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ACCELERATING IDENTIFICATION OF SINGLE NUCLEOTIDE 
POLYMORPHISMS AND ALIGNMENT OF CLONES IN GENOMIC 

SEQUENCING 

5 This application claims the benefit of U.S. Provisional Patent 

Application Serial No. 60/1 14,881, filed January 6, 1999. 

The present invention was made with fiinding fi-om National Institutes 
'of Health Grant No. GM38839. The United Stated Government may have certain 
rights in this invention. 

10 

FIELD OF THE INVENTION 



The present invention is directed to accelerating identification of single 
nucleotide polymorphisms and an alignment of clone in genomic sequencing. 

15 

BACKGROUND OF THE INVENTION 

Introduction to Applications of SNPS 

20 Accumulation of genetic changes affecting cell cycle control, cell 

differentiation, apoptosis, and DNA replication and repair lead to carcinogenesis 
(Bishop, J. M., "Molecular Themes In Oncogenesis," Ceil, 64(2):235-48 (1991)). 
DNA alterations include large deletions which inactivate tumor supressor genes, 
amplification to increase expression of oncogenes, and most commonly single 

25 nucleotide mutations or polymorphisms which impair gene expression or gene 
function or predispose an individual to further genomic instability (Table 1). 
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Tabic 1 : Grnetic Alterations Commonly Found In the Human Genome 



Type of 
Alteration 


Possible Causes of Alteration 


Possible Consequences of 
Alteration 


Deiection of Alteration 


Single nucleotide 
polymorphism 


Inherited variation 
Methylatton 

Defective repair genes 


Silent: does not alter function 
Misscnsc: alters gene function 


DNA sequencing 
SSCP. DGGE, CDGE 
Protein truncation 
Mismatch cleavage 


Microsatellite 

mctnhilitv /MPMI 


Defective DNA repair genes 


Frameshift: truncates gene 


Microsatellitc Analysis 


Large deletions 


Defective DNA repair genes 
Defective DNA replication genes 
Illegitimate recombination 
Double strand break 


Loss of gene function 


Loss of heterozygosity 
CGH 

SNP analysis 


DNA amplifications 


Defective DNA repair genes 
Defective DNA replication genes 
Illegitimate recombination 


Overexpression of gene 


Competitive PCR 
CGH 

SNP analysis 


Others: 

Methylalion, 

Translocation 


Defective mctliylase genes 
Double strand break 


Gene silencing or overexpression: 
creation of chimeric protein 


Endonuclease digestion 
PCR. FISH 



Rapid detection of gerailine mutations in individuals at risk and accurate 
characterization of genetic changes in individual tumors would provide opportunities 
5 to improve early detection, prevention, prognosis, and specific treatment. However, 
genetic detection poses the problem of identifying a predisposing polymorphism in 
the germline or an index mutation in a pre-malignant lesion or early cancer that may 
be present at many potential sites in many genes. Furthermore, quantification of allele 
copy number is necessary to detect gene amplification and deletion. Therefore, 
1 0 technologies are urgently needed that can rapidly detect mutation, allele deletion, and 
allele amplification in multiple genes. Single nucleotide polymorphisms ("SNP"s) are 
. potentially powerful genetic markers for early detection, diagnosis, and staging of 
human cancers. 

Identification of DNA sequence polymorphisms is the cornerstone of 
1 5 modem genome mapping. Initially, maps were created using RFLP markers 
(Botstein, D., et al., "Construction Of A Genetic Linkage Map In Man Using 
Restriction Fragment Length Polymorphisms," Amer. J. Hum. Genet. , 32:314-331 
(1980)), and later by the more polymorphic dinucleotide repeat sequences (Weber, J. 
L. et al., "Abundant Class Of Human DNA Polymorphisms Which Can Be Typed 
20 Using The Polymerase Chain Reaction,." Amer. J. Hum. Genet. . 44:388-396 (1989) 
and Reed, P. W., et al., "Chromosome-Specific Microsatellite Sets For Fluorescence- 
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Based, Semi-Automated Genome Mapping," Nat Genet . 7(3): 390-5 (1994)). Such 
sequence polymorphisms may also be used to detect inactivation of tumor suppressor 
genes via LOH and activation of oncogenes via amplification. These genomic 
changes are currently being analyzed using conventional Southern hybridizations, 
5 competitive PGR, real-time PGR, microsatellite marker analysis, and comparative 
genome hybridization (GGH) (Ried, T., et al., "Gomparative Genomic Hybridization 
Reveals A Specific Pattern Of Ghromosomal Gains And Losses During The Genesis 
Of Colorectal Tumors." Genes. Ghromosomes & Gancer . 15(4):234-45 (1996), 
Kallioniemi, et al., "ERBB2 Amplification In Breast Gancer Analyzed By 

10 Fluorescence In Situ Hybridization," Proc Natl Acad Sci USA . 89(12):5321-5 
(1992), Kallioniemi, et al., "Comparative Genomic Hybridization: A Rapid New 
Method For Detecting And Mapping DNA Amplification In Tumors," Semin Gancer 
Biol . 4(l):41-6 (1993), Kallioniemi, et al., "Detection And Mapping Of Amplified 
DNA Sequences In Breast Gancer By Comparative Genomic Hybridization," Proc 

15 Natl Acad Sci USA , 91(6):2156-60 (1994), Kallioniemi, et al., "Identification Of 
Gains And Losses Of DNA Sequences In Primary Bladder Gancer By Comparative 
Genomic Hybridization." Genes Chromosom Cancer . 12(3):213-9 (1995), Schwab, 
M., et al., "Amplified DNA With Limited Homology To Myc Cellular Oncogene Is 
Shared By Human Neuroblastoma Cell Lines And A Neuroblastoma Tumour," 

20 Nature . 305(593 1):245-8 (1983), Solomon, E., et al., "Chromosome 5 Allele Loss In 
Human Colorectal Carcinomas," Nature . 328(61 3 1):61 6-9 (1987), Law, D. J., et al., 
"Concerted Nonsyntenic Allelic Loss In Human Colorectal Carcinoma," Science . 
24I(4868):961-5 (1988)., Frye, R. A., et al., "Detection Of Amplified Oncogenes By 
Differential Polymerase Chain Reaction," Oncogene , 4(9): 1 153-7 (1989), Neubauer, 

25 A., et al., "Analysis Of Gene Amplification In Archival Tissue By Differential 

Polymerase Chain Reaction," Oncogene . 7(5):1019-25 (1992), Chiang, P. W., et al., 
"Use Of A Fluorescent-PCR Reaction To Detect Genomic Sequence Copy Number 
And Transcriptional Abundance," Genome Research . 6(10):1013-26 (1996), Heid, C. 
A., et al., "Real Time Quantitative PGR," Gen o m e R e s e arch , 6(10):986-94 (1996), 

30 Lee, H. H., et al., "Rapid Detection Of Trisomy 21 By Homologous Gene 

Quantitative PGR (HGQ-PGR)," Human Genetics . 99(3):364-7 (1997), Boland, C. R., 
et al., "Microallelotyping Defines The Sequence And Tempo Of Allelic Losses At 
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Tumour Suppressor Gene Loci During Colorectal Cancer Progression," Nature 
Medicine . l(9):902-9 (1995), Cawkwell, L., et al., "Frequency Of Allele Loss Of 
DCC, p53, RBI, WTl, NFl, NM23 And APC/MCC In Colorectal Cancer Assayed By 
Fluorescent Multiplex Polymerase Chain Reaction." Br J Cancer , 70(5):813-8 (1994), 
5 and Hampton, G. M., et al., "Simultaneous Assessment Of Loss Of Heterozygosity At 
Multiple Microsatellite Loci Using Semi-Automated Fluorescence-Based Detection: 
Subregional Mapping Of Chromosome 4 In Cervical Carcinoma," Proceedings of the 
National Academy of Sciences of the United States of America . 93(13):6704-9 
( 1 996)). Competitive and real-time PCR are considerably faster and require less 

10 material than Southern hybridization, although neither technique is amenable to 
multiplexing. Current multiplex microsatellite marker approaches require careful 
attention to primer concentrations and amplification conditions. While PCR products 
may be pooled in sets, this requires an initial run on agarose gels to approximate the 
amount of DNA in each band (Reed, P. W., et al., "Chromosome-Specific 

1 5 Microsatellite Sets For Fluorescence-Based, Semi-Automated Genome Mapping," 
Nat Genet . 7(3): 390-5 (1994), and Hampton, G. M., et al., "Simultaneous 
Assessment Of Loss Of Heterozygosity At Multiple Microsatellite Loci Using Semi- 
Automated Fluorescence-Based Detection: Subregional Mapping Of Chromosome 4 
In Cervical Carcinoma," Proc. NatM. Acad. Sci. USA . 93(13):6704-9 (1996)). CGH 

20 provides a global assessment of LOH and amplification, but with a resolution range of 
about 20 Mb. To improve gene mapping and discovery, new techniques are urgently 
needed to allow for simultaneous detection of multiple genetic alterations. 

Amplified fragment length polymorphism ("AFLP") technology is a 
powerful DNA fingerprinting technique originally developed to identify plant 

25 polymorphisms in genomic DNA. It is based on the selective amplification of 
restriction fragments fi-om a total digest of genomic DNA. 

The original technique involved three steps: (1) restriction of the 
genomic DNA, i.e. with EcoKl and Msel, and ligation of oligonucleotide adapters, 
(2) selective amplification of a subset of all the fragments in the total digest using 

30 primers which reached in by from I to 3 bases, and (3) gel-based analysis of the 

amplified fragments. Janssen, et al., "Evaluation of the DNA Fingerprinting Method 
AFLP as an New Tool in Bacterial Taxonomy," Microbiolog y, 142(Pt 7): 188 1-93 
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(1996); Thomas, et al., "Identification of Amplified Restriction Fragment 
Polymorphism (AFLP) Markers Tightly Linked to the Tomato Cf-9 Gene for 
Resistance to Cladosporium fulvum Plant J . 8(5):785-94 (1995); Vos, et al., 
"AFLP: A New Technique for DNA Fingerprinting," Nucleic Acids Res . 
5 23(21):4407-14 (1995); Bachem, et al., "Visualization of Differential Gene 

Expression Using a Novel Method of RNA Fingerprinting Based on AFLP: Analysis 
of Gene Expression During Potato Tuber Development," Plant L 9(5):745-53 (1996): 
and Meksem, et al., "A High-Resolution Map of the Vicinity of the Rl Locus on 
Chromosome V of Potato Based on RFLP and AFLP Markers " Mol Gen Genet . 

10 249(1 ):74-81 (1995), which are hereby incorporated by reference. 

AFLP differs substantially from the present invention because it: 
(i) uses palindromic enzymes, (ii) amplifies both desired EcoKl-Msel as well as 
unwanted Msel- Msel fragments, and (iii) does not identify both alleles when a SNP 
destroys a pre-existing restriction site. Further, AFLP does not identify SNPs which 

15 are outside restriction sites. AFLP does not, and was not designed to create a map of 
a genome. 

Representational Difference Analysis (RDA) was developed by N. 
Lisitsyn and M. Wigler to isolate the differences between two genomes (Lisitsyn, et 
al., "Cloning tlie Differences Between Two Complex Genomes," Science . 259:946- 

20 951 (1993), Lisitsyn, et al., "Direct Isolation of Polymorphic Markers Linked to a 
Trait by Genetically Directed Representational Difference Analysis," Nat Genet . 
6(l):57-63 (1994); Lisitsyn, et al., "Comparative Genomic Analysis of Tumors: 
Detection of DNA Losses and Amplification," P rOPN^U Ac a4 S c j USA, 92(l):151-5 
(1995); Thiagalingam, et al., "Evaluation of the FHIT Gene in Colorectal Cancers," 

25 Cancer Res . 56(13):2936-9 (1996), Li, et al, "PTEN, a Putative Protein Tyrosine 
Phosphatase Gene Mutated in Human Brain, Breast, and Prostate Cancer," Science . 
275(5308): 1943-7 (1997); and Schutte, et al., "Identification by Representational 
Difference Analysis of a Homozygous Deletion in Pancreatic Carcinoma That Lies 
Within the BRCA2 Region," ProcNatl Acad Sci USA . 92(13):5950-4 (1995). The 

30 system was developed in which subtractive and kinetic enrichment was used to purify 
restriction endonuclease fragments present in one DNA sample, but not in another. 
The representational part is required to reduce the complexity of the DNA and 
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generates "ampiicons". This allows isolation of probes that detect viral sequences in 
human DNA, polymorphisms, loss of heterozygosities, gene amplifications, and 
genome rearrangements. 

The principle is to subtract 'Hester'* amplicons from an excess of 
5 ''driver'* amplicons. When the tester DNA is tumor DNA and the driver is normal 
DNA, one isolates gene amplifications. When the tester DNA is normal DNA and the 
driver is tumor DNA, one isolates genes which lose function (i.e. tumor suppressor 
genes). 

A brief outline of the procedure is provided herein: (i) cleave both 

1 0 tester and driver DNA with the same restriction endonuclease, (ii) ligate 

unphosphorylated adapters to tester DNA, (iii) mix a 10-fold excess of driver to tester 
DNA. melt and hybridize, (iv) fill in ends, (v) add primer and PGR amplify, 
(vi) digest ssDNA with mung bean nuclease, (vii) PGR amplify, (viii) repeat steps (i) 
to (vii) for 2-3 rounds, (ix) clone fragments and sequence. 

1 5 RDA differs substantially fi"om the present invention because it: (i) is 

a very complex procedure, (ii) is used to identify only a few differences between a 
tester and driver sample, and (iii) does not identify both alleles when a SNP destroys a 
pre-existing restriction site. Further, RDA does not identify SNPs which are outside 
restriction sites. RDA does not, and was not designed to create a map of a genome. 

20 The advent of DNA arrays has resulted in a paradigm shift in detecting 

vast numbers of sequence variation and gene expression levels on a genomic scale 
(Pease, A. C., et al., "Light-Generated Oligonucleotide Arrays For Rapid DNA 
Sequence Analysis," Proc Natl Acad Sci USA . 91(1 1):5022-6 (1994), Lipshutz, R. 
J., et al., "Using Oligonucleotide Probe Arrays To Access Genetic Diversity," 

25 Biotechniques . 19(3):442-7 (1995), Eggers, M., et al., "A Microchip For Quantitative 
Detection Of Molecules Utilizing Luminescent And Radioisotope Reporter Groups," 
Biotechniques . 17(3):516-25 (1994), Guo, Z., et al., "Direct Fluorescence Analysis Of 
Genetic Polymorphisms By Hybridization With Oligonucleotide Arrays On Glass 
Supports," Nucleic Acids Res . 22(24): 5456-65 (1994), Beattie, K. L., et al., 

30 "Advances In Genosensor Research," Clinical Chemistry . 41(5):700-6 (1995), Hacia, 
J. G., et al., "Detection Of Heterozygous Mutations In BRCAl Using High Density 
Oligonucleotide Arrays And Two-Colour Fluorescence Analysis," Nature Genetics . 



wo 00/40755 



PCTAJSOO/00144 



14(4):441-7 (1996), Chee, M., et al., "Accessing Genetic Information With High- 
Density DNA Arrays " Science . 274(5287):610-4 (1996), Cronin, M. T., et al., 
''Cystic Fibrosis Mutation Detection By Hybridization To Light-Generated DNA 
Probe Arrays." Hum Mutat . 7(3):244-55 (1996), Drobyshev, A,, et al., "Sequence 
5 Analysis By Hybridization With Oligonucleotide Microchip: Identification Of Beta- 
Thalassemia Mutations," Qsne, 188(l):45-52 (1997), Kozal, M. J., et al., "Extensive 
Polymorphisms Observed In HIV-1 Clade B Protease Gene Using High-Density 
Oligonucleotide Arrays," Nature Medicine . 2(7):753-9 (1996), Yershov, G., et al., 
"DNA Analysis And Diagnostics On Oligonucleotide Microchips," Proc Natl Acad 
10 Sci USA , 93(10):4913-8 (1996), DeRisi, J., et al, "Use Of A CDNA Microarray To 
Analyse Gene Expression Patterns In Human Cancer," Nature Genetics , 14(4):457-60 
(1996), Schena, M., et al., "Parallel Human Genome Analysis: Microarray-Based 
Expression Monitoring Of 1 000 Genes," Proc. NatM. Acad. Sci. USA . 93(20): 1 06 1 4-9 

(1996) , Shalon, D., et al., "A DNA Microarray System For Analyzing Complex DNA 
1 5 Samples Using Tv/o-Color Fluorescent Probe Hybridization," Genome Research . 

6(7):639-45 (1996)). Determining deletions, amplifications, and mutations at the 
DNA level will complement the mformation obtained from expression profiling of 
tumors (DeRisi, J., et al., "Use Of A cDNA Microarray To Analyse Gene Expression 
Patterns In Human Cancer," Nature Genetics . 14(4):457-60 (1996), and Zhang, L., et 
20 al., "Gene Expression Profiles In Normal And Cancer Cells," Science . 276:1268-1272 

(1997) ). DNA chips designed to distinguish single nucleotide differences are 
generally based on the principle of "sequencing by hybridization" (Lipshutz, R. J., et 
al., "Using Oligonucleotide Probe Arrays To Access Genetic Diversity," 
Biotechniques . 19(3):442-7 (1995), Eggers, M., et al., "A Microchip For Quantitative 

25 Detection Of Molecules Utilizing Luminescent And Radioisotope Reporter Groups," 
Biotechniques . 17(3):5 16-25 (1994), Guo, Z., et al., "Direct Fluorescence Analysis Of 
Genetic Polymorphisms By Hybridization With Oligonucleotide Arrays On Glass 
Supports," Nucleic Acids Res . 22(24):5456-65 (1 994), Beattie, K. L., et al„ 
"Advances In Genosensor Research." Clinical Chemistrv . 41(5):700-6 (1995), Hacia, 

30 J. G., et al., "Detection Of Heterozygous Mutations In BRCAl Using High Density 
Oligonucleotide Arrays And Two-Colour Fluorescence Analysis," Nature Genetics , 
14(4):441-7 (1996), Chee, M., et al., "Accessing Genetic Information With High- 



wo 00/40755 



-8- 



PCT/USOO/00144 



Density DNA Arrays " Science . 274(5287):610-4 (1996), Cronin, M. T., et al., 
"Cystic Fibrosis Mutation Detection By Hybridization To Light-Generated DNA 
Probe Arrays," Hum Mutat . 7(3):244-55 (1996), Drobyshev, A., et a!., "Sequence 
Analysis By Hybridization With Oligonucleotide Microchip: Identification Of Beta- 
5 Thalassemia Mutations," Gene, 188(l):45-52 (1997), Kozal, M. J., et al., "Extensive 
Polymorphisms Observed In HIV-1 Glade B Protease Gene Using High-Density 
Oligonucleotide Arrays," Nature Medicine . 2(7):753-9 (1996), and Yershov, G., et al., 
"DNA Analysis And Diagnostics On Oligonucleotide Microchips," Proc Natl Acad 
Sci USA . 93(1 0):491 3-8 (1996)), or polymerase extension of arrayed primers 

1 0 (Nikiforov, T. T., et al., "Genetic Bit Analysis: A Solid Phase Method For Typing 
Single Nucleotide Polymorphisms," Nucleic Acids Research . 22(20):4 167-75 (1994), 
Shumaker, J. M., et al., "Mutation Detection By Solid Phase Primer Extension," 
Human Mutation . 7(4):346-54 (1996), Pastinen, T., et ah, "Minisequencing: A 
Specific Tool For DNA Analysis And Diagnostics On Oligonucleotide Arrays," 

15 Genome Research . 7(6):606-14 (1997), and Lockley, A. K., et ah, "Colorimetric 
Detection Of Immobilised PGR Products Generated On A Solid Support," Nucleic 
Acids Research . 25(6): 1313-4(1 997) (See Table 2)). While DNA chips can confirm 
a known sequence, similar hybridization profiles create ambiguities in distinguishing 
heterozygous from homozygous alleles (Eggers, M., et al., "A Microchip For 

20 Quantitative Detection Of Molecules Utilizing Luminescent And Radioisotope 
Reporter Groups," Biotechniques . 17(3):5 16-25 (1994), Beattie, K. L., et al., 
"Advances In Genosensor Research," Clinical Chemistry . 41(5):700-6 (1995), Chee, 
M., et al., "Accessing Genetic Information With High-Density DNA Arrays," 
Science . 274(5287):610-4 (1996), Kozal, M. J., et al., "Extensive Polymorphisms 

25 Observed In HIV-1 Clade B Protease Gene Using High-Density Oligonucleotide 
Arrays," Nature Medicine . 2(7):753-9 (1996), and Southern, E. M., "DNA Chips: 
Analysing Sequence By Hybridization To Oligonucleotides On A Large Scale," 
Trends in Genetics . 1 2(3): 1 10-5 (1996)). Attempts to overcome this problem include 
using two-color fluorescence analysis (Hacia, J. G., et al., "Detection Of 

30 Heterozygous Mutations In BRCAl Using High Density Oligonucleotide Arrays And 
Two-Colour Fluorescence Analysis." Nature Genetics . 14(4):44 1-7 (1996)), 40 
overlapping addresses for each known polymorphism (Cronin, M. T., et al., "Cystic 
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Fibrosis Mutation Detection By Hybridization To Light-Generated DNA Probe 
Arrays," Hum Mutat . 7(3):244-55 (1996)), nucleotide analogues in the array sequence 
(Guo, Z., et aL, "Enhanced Discrimination Of Single Nucleotide Polymorphisms By 
Artificial Mismatch Hybridization," Nature Biotech. . 15:331-335 (1997)), or adjacent 
5 co-hybridized oligonucleotides (Drobyshev, A., et ah, "Sequence Analysis By 

Hybridization With Oligonucleotide Microchip: Identification Of Beta-Thalassemia 
Mutations," Gene . 188(l):45-52 (1997) and Yershov, G., et aL, "DNA Analysis And 
Diagnostics On Oligonucleotide Microchips," Proc Natl Acad Sci USA . 93(10):4913- 
8 (1996)), In a side-by-side comparison, nucleotide discrimination using the 

1 0 hybridization chips fared an order of magnitude worse than using primer extension 
(Pastinen, T., et ah, "Minisequencing: A Specific Tool For DNA Analysis And 
Diagnostics On Oligonucleotide Arrays," Qegome jlegearch, 7(6):606-14 (1997)). 
Nevertheless, solid phase primer extension also generates false positive signals from 
mononucleotide repeat sequences, template-dependent errors, and template- 

15 independent errors (Nikiforov, T. T., et al., "Genetic Bit Analysis: A Solid Phase 
Method For Typing Single Nucleotide Polymorphisms," Nucl. Acids Res. . 
22(20):4 167-75 (1994) and Shumaker, J. M., et al., "Mutation Detection By Solid 
Phase Primer Extension," Human Mutation . 7(4):346-54 (1996)). 

Over the past few years, an alternate strategy in DNA array design has 

20 been pursued. Combined vnih solution-based polymerase chain reaction/ligase 

detection assay (PCR/LDR) this array allows for accurate quantification of each SNP 
allele (See Table 2). 

Table 2: Comparison of high-throughput techniques to quantify known SNPs In clinical samples. 
Technique Advantages Disadvantages 

Hybridization on 1 ) High density: up to 135,000 addresses. 1) Specificity determined by hybridization: 

DNA array 2) Scan for SNPs in thousands of loci. - difficult to distinguish all SNPs. 

3) Detects small insertions/deletions. - difTicult to quantify allelic imbalance. 

2) Each new DNA target requires a new array. 

Mini-sequencing 1) Uses high fidelity polymerase extension: - 1 ) Cannot dciea small insertions/deletions. 
(SNuPE) on minimizes false positive signal. 2) Each new DNA target requires a new array. 

DNA array 2) Potential for single-tube assay. 

PCR/LDR witli 1) Uses higli fidelity thermostable I igase; 1) Requires synthesis of many ligation primers, 

zip-code capture - minimizes false positive signal, 
on universal 2) Separates SNP identification from signal capture; 
DNA array avoids problems of false hybridization. 

3) Quantify gene amplifications and deletions. 

4) Universal array works for all gene targets. 



25 
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For high throughput detection of specific multiplexed LDR products, unique 
addressable array-specific sequences on the LDR probes guide each LDR product to a 
designated address on a DNA array, analogous to molecular tags developed for 
5 bacterial and yeast genetics genetics (Hensel, M., et al., "Simultaneous Identification 
Of Bacterial Virulence Genes By Negative Selection," Science . 269(5222):400-3 
(1995) and Shoemaker, D. et ah, "Quantitative Phenotypic Analysis Of Yeast 
Deletion Mutants Using A Highly Parallel Molecular Bar-Coding Strategy," Nat 
Genet . 14(4):450-6 (1996)). The specificity of this reaction is determined by a 

1 0 thermostable ligase which allows detection of (i) dozens to hundreds of 

polymorphisms in a single-tube multiplex format, (ii) small insertions and deletions in 
repeat sequences, and (iii) low level polymorphisms in a background of normal DNA. 
By uncoupling polymorphism identification from hybridization, each step may be 
optimized independently, thus allowing for quantitative assessment of allele 

1 5 imbalance even in the presence of stromal cell contamination. This approach has the 
potential to rapidly identify multiple gene deletions and amplifications associated with 
tumor progression, as well as lead to the discovery of new oncogenes and tumor 
suppressor genes. Further, the ability to score himdreds to thousands of SNPs has 
utility in linkage studies (Nickerson, D. A., et al., "Identification Of Clusters Of 

20 Biallelic Polymorphic Sequence-Tagged Sites (pSTSs) Tliat Generate Highly 

Informative And Automatable Markers For Genetic Linkage Mapping," Genomics . 
12(2):377-87 (1992), Lin, Z., et al., "Multiplex Genotype Determination At A Large 
Number Of Gene Loci," Proc Natl A p ad Scj USA . 93(6):2582.7 (1996), Fanning, G. 
C, et al., "Polymerase Chain Reaction Haplotyping Using 3' Mismatches In The 

25 Forward And Reverse Primers: Application To The Biallelic Polymorphisms Of 
Tumor Necrosis Factor And Lymphotoxin Alpha," Tissue Antigens , 50(l):23-3 1 
(1997), and Kruglyak, L., "The Use of a Genetic Map of Biallelic Markers in Linkage 
Studies," Nature Genetics . 17:21-24 (1997)), human identification (Delahunty, C, et 
al., "Testing The Feasibility Of DNA Typing For Human Identification By PGR And 

30 An Oligonucleotide Ligation Assay," Am. J. Hum. Gen. . 58(6):1239-46 (1996) and 
Belgrader, P., et al., "A Multiplex PCR-Ligase Detection Reaction Assay For Human 
Identity Testing," Gen. Sci. & Tech. . 1:77-87 (1996)), and mapping complex human 
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diseases using association studies where SNPs are identical by decent (Collins, F. S., 
"Positional Cloning Moves From Perditional To Traditional," Nat Genet , 9(4):347-50 
(1995), Lander, E. S., 'The New Genomics: Global Views Of Biology," Science . 
274(5287):536-9 (1996), Risch, N. et al., 'The Future Of Genetic Studies Of Complex 
5 Human Diseases " Science . 273(5281):1516-7 (1996), Cheung, V. G. et a!., "Genomic 
Mismatch Scanning Identifies Human Genomic DNA Shared Identical By Descent," 
Genomics . 47(l):l-6 (1998), Heung, V. G., et al., '^Linkage-Disequilibrium Mapping 
Without Genotvping." Nat Genet . 1 8(3^:225-230 (1998), and McAllister, L., et al, 
"Enrichment For Loci Identical-By-Descent Between Pairs Of Mouse Or Human 

10 Genomes By Genomic Mismatch Scanning," Genomics . 47(1):7-1 1 (1998)). 

For 85% of epithelial cancers, loss of heterozygosity and gene 
amplification are the most frequently observed changes which inactivate the tumor 
suppressor genes and activate the oncogenes. Southern hybridizations, competitive 
PCR, real time PCR, microsatellite marker analysis, and comparative genome 

1 5 hybridization (CGH) have all been used to quantify changes in chromosome copy 
number (Ried, T., et al, "Comparative Genomic Hybridization Reveals A Specific 
Pattern Of Chromosomal Gains And Losses During The Genesis Of Colorectal 
Tumors," Genes. Chromosomes & Cancer . 15(4):234-45 (1996), Kallioniemi, et al., 
"ERBB2 Amplification In Breast Cancer Analyzed By Fluorescence In Situ 

20 Hybridization," Proc Natl Acad Sci USA . 89(12):5321-5 (1992), Kallioniemi, et al,, 
"Comparative Genomic Hybridization: A Rapid New Method For Detecting And 
Mapping DNA Amplification In Tumors," Semin Cancer BioL 4(l):41-6 (1993), 
Kallioniemi, et al., "Detection And Mapping Of Amplified DNA Sequences In Breast 
Cancer By Comparative Genomic Hybridization," Proc Natl Acad Sci USA . 

25 91(6):2156-60 (1994), Kallioniemi, et al., "Identification Of Gains And Losses Of 
DNA Sequences In Primary Bladder Cancer By Comparative Genomic 
Hybridization," Genes Chromosom Cancer . 12(3):213-9 (1995), Schwab, M., et al., 
"Amplified DNA With Limited Homology To Myc Cellular Oncogene Is Shared By 
Human Neuroblastoma Cell Lines And A Neuroblastoma Tumour," Nature . 

30 305(593 1):245-8 (1983), Solomon, E., et al., "Chromosome 5 Allele Loss In Human 
Colorectal Carcinomas," Nature . 328(613 1):61 6-9 (1987), Law, D. J., et al., 
"Concerted Nonsyntenic Allelic Loss In Human Colorectal Carcinoma," Science . 
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241(4868):961-5 (1988), Frye, R. A., et al., "Detection Of Amplified Oncogenes By 
Differential Polymerase Chain Reaction," Oncogene . 4(9): 11 53-7 (1989), Neubauer, 
A., et al., "Analysis Of Gene Amplification In Archival Tissue By Differential 
Polymerase Chain Reaction," Oncogene . 7(5):1019-25 (1992), Chiang, P. W., et al., 
5 "Use Of A Fluorescent-PCR Reaction To Detect Genomic Sequence Copy Number 
And Transcriptional Abundance," Genome Research . 6(10):1013-26 (1996), Heid, C. 
A., et al., "Real Time Quantitative PCR," Genome Research . 6(10):986-94 (1996), 
Lee, H. H., et al. , "Rapid Detection Of Trisomy 21 By Homologous Gene 
Quantitative PCR fHGO-PCR\" Human Genetics . 99(3):364-7 (1997), Boland, C. R., 

1 0 et al., "Microallelotyping Defines The Sequence And Tempo Of Allelic Losses At 
Tumour Suppressor Gene Loci During Colorectal Cancer Progression," Nature 
Medicine . l(9):902-9 (1995), Cawkwell, L., et al., "Frequency Of Allele Loss Of 
DCC, p53, RBI, WTl, NFl, NM23 And APC/MCC In Colorectal Cancer Assayed By 
Fluorescent Multiplex Polymerase Chain Reaction." Br J Cancer . 70(5):813-8 (1994), 

15 and Hampton, G. M., et al., "Simultaneous Assessment Of Loss Of Heterozygosity At 
Multiple Microsatellite Loci Using Semi-Automated Fluorescence-Based Detection: 
Subregional Mapping Of Chromosome 4 In Cervical Carcinoma," Proc. Nat'l. Acad. 
Sci. USA . 93(13):6704-9 (1996)). Recently, a microarray of consecutive BACs from 
the long arm of chromosome 20 has been used to accurately quantify 5 regions of 

20 amplification and one region of LOH associated with development of breast cancer. 
This area was previously thought to contain only 3 regions of amplification (Tanner, 
M. et al., "Independent Amplification And Frequent Co-Amplification Of Three 
Nonsyntenic Regions On The Long Arm Of Chromosome 20 In Human Breast 
Cancer," Cancer Research . 56(15):3441-5 (1996)). Although this approach will yield 

25 valuable information from cell lines, it is not clear it will prove quantitative when 
starting with microdissected tissue which require PCR amplification. Competitive 
and real time PCR approaches require careful optimization to detect 2-fold differences 
(Frye, R. A., et al., "Detection Of Amplified Oncogenes By Differential Polymerase 
Chain Reaction," Oncogene . 4(9):1 153-7 (1989), Neubauer, A., et al., "Analysis Of 

30 Gene Amplification In Archival Tissue By Differential Polymerase Chain Reaction," 
Oncogene . 7(5):1019-25 (1992), Chiang, P. W., et al., "Use Of A Fluorescent-PCR 
Reaction To Detect Genomic Sequence Copy Number And Transcriptional 
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Abundance," Genome Research . 6(10):1013-26 (1996), Heid, C. A., et al., "Real 
Time Quantitative PGR," Genome Research . 6(10):986-.94 (1996), and Lee, H. H., et 
al., "Rapid Detection Of Trisomy 21 By Homologous Gene Quantitative PGR (HGQ- 
PGR)," Human Genetics . 99(3):364-7 (1997)). Unfortunately, stromal contamination 
5 may reduce the ratio between tumor and normal chromosome copy number to less 
than 2-fold. By using a quantitative SNP -DNA array detection, each allele can be 
distinguished independently, thus reducing the effect of stromal contamination in half 
Further by comparing the ratio of allele-specific LDR product formed from a tumor to 
control gene between a tumor and normal sample, it may be possible to distinguish 
1 0 gene amplification from loss of heterozygosity at multiple loci in a single reaction. 

Using PCRA.DR to detect SNPs. 

The ligase detection reaction ("LDR") is ideal for multiplexed 

1 5 discrimination of single-base mutations or polymorphisms (Barany , F., et al., 
"Gloning, Overexpression, And Nucleotide Sequence Of A Thermostable DNA 
Ligase Gene," Gene . 109:1-1 1 (1991), Barany, F., "Genetic Disease Detection And 
DNA Amplification Using Gloned Thermostable Ligase," Proc. Natl. Acad. Sci. USA . 
88:189-193 (1991), and Barany, F., "The Ligase Ghain Reaction (LGR) In A PGR 

20 World," PG R Metho d s an d Ap pl ications , 1 :5-16 (1991)). Since there is no 

polymerization step, several probe sets can ligate along a gene without interference. 
The optimal multiplex detection scheme involves a primary PGR amplification, 
followed by either LDR (two probes, same strand) or ligase chain reaction ("LGR") 
(four probes, both strands) detection. This approach has been successfully applied for 

25 simultaneous multiplex detection of 61 cystic fibrosis alleles (Grossman, P. D., et al., 
"High-Density Multiplex Detection Of Nucleic Acid Sequences: Oligonucleotide 
Ligation Assay And Sequence-Goded Separation," Nucleic Acids Res. . 22:4527-4534 
(1994) and Eggerding, F. A., et al., "Fluorescence-Based Oligonucleotide Ligation 
Assay For Analysis Of Gystic Fibrosis Transmembrane Gonductance Regulator Gene 

30 Mutations," Human Mutation . 5:153-165 (1995)), 6 hyperkalemic periodic paralysis 
alleles (Feero, W. T., et al., "Hyperkalemic Periodic Paralysis: Rapid Molecular 
' Diagnosis And Relationship Of Genotype To Phenotype In 12 Families," Neurology . 
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43:668-673 (1993)), and 20 21 -hydroxylase deficiency alleles (Day, D., et a!., 
"Detection Of Steroid 21 Hydroxylase Alleles Using Gene-Specific PGR And A 
Multiplexed Ligation Detection Reaction," Genomics , 29:152-162 (1995) and Day, D. 
J., et a!., "Identification Of Non- Amplifying CYP21 Genes When Using PCR-Based 
5 Diagnosis Of 21 -Hydroxylase Deficiency In Congenital Adrenal Hyperplasia (CAH) 
Affected Pedigrees," Hum Mol Genet . 5(12):2039-48 (1996)). 

21 -hydroxylase deficiency has the highest carrier rate of any genetic 
disease, with 6% of Ashkenazi Jews being carriers. Approximately 95% of mutations 
causing 21 -hydroxylase deficiency are the result of recombinations between an 

10 inactive pseudogene termed CYP21P and the normally active gene termed CYP21, 
which share 98% sequence homology (White, P. C., et al., "Structure Of Human 
Steroid 21 -Hydroxylase Genes," Proc. Natl. Acad. Sci. USA . 83:51 11-51 15 (1986)). 
PCR/LDR was developed to rapidly determine heterozygosity or homozygosity for 
any of the 10 common apparent gene conversions in CYP2L By using allele-specific 

15 PGR, defined regions of CYP21 are amplified without amplifying the CYP21P 
sequence. The presence of wild-type or pseudogene mutation is subsequently 
determined by fluorescent LDR. Discriminating oligonucleotides complementary to 
both CYP21 and CYP21P are included in equimolar amounts in a single reaction tube 
so that a signal for either active gene, pseudogene, or both is always obtained. 

20 PCR/LDR genotyping (of 82 samples) was able to readily type compound 

heterozygotes with multiple gene conversions in a multiplexed reaction, and was in 
complete agreement with direct sequencing/ASO analysis. This method was able to 
distinguish insertion of a single T nucleotide into a (T)7 tract, which carmot be 
achieved by allele-specific PGR alone (Day, D., et al., "Detection Of Steroid 21 

25 Hydroxylase Alleles Using Gene-Specific PGR And A Multiplexed Ligation 

Detection Reaction," Genomics . 29:152-162 (1995)). A combination of PCR/LDR 
and microsatellite analysis revealed some unusual cases of PGR allele dropout (Day, 
D. J., et al., "Identification Of Non-Amplifying GYP21 Genes When Using PCR- 
Based Diagnosis Of 21 -Hydroxylase Deficiency In Congenital Adrenal Hyperplasia 

30 (CAH) Affected Pedigrees," Hum Mol Genet . 5(12):2039-48 (1996)). The LDR 
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approach is a single-tube reaction which enables multiple samples to be analyzed on a 
single polyacrylamide gel. 

A PCR/LDR assay has been developed to detect germline mutations, 
found at high frequency (3% total), in BRCAl and BRCA2 genes in the Jewish 
5 population. The mutations are: BRCAl, exon 2 185delAG; BRCAl, exon 20 

5382insC; BRCA2, exon 1 1 6174delT. These mutations are more difficult to detect 
than most germline mutations, as they involve slippage in short repeat regions. A 
prehminary screening of 20 samples using multiplex PCR of three exons and LDR of 
six alleles in a single tube assay has successfully detected the three Ashkenazi 
10 BRCAl and BRCA2 mutations. 

Multiplexed PCR for amplifying many regions of chromosomal DNA simultaneously. 

A coupled multiplex PCR/PCR/LDR assay was developed to identify 

15 armed forces personnel. Several hundred SNPs in known genes with heterozygosities 
> 0.4 are currently listed. Twelve of these were amplified in a single PCR reaction as 
follows: Long PCR primers were designed to have gene-specific 3' ends and 5' ends 
complementary to one of two sets of PCR primers. The upstream primers were 
synthesized with either FAM- or TET-fluorescent labels. These 24 gene-specific 

20 primers were pooled and used at low concentration in a 15 cycle PCR. After this, the 
two sets of primers were added at higher concentrations and the PCR was continued 
for an additional 25 cycles. The products were separated on an automated ABD 373 A 
DNA Sequencer. The use of these primers produces similar amounts of multiplexed 
products without the need to carefiilly adjust gene-specific primer concentrations or 

25 PCR conditions (Belgrader, P., et al., "A Multiplex PCR-Ligase Detection Reaction 
Assay For Human Identity Testing," Genome Science and Technology, 1 :77-87 
(1996)). In a separate experiment, non-fluorescent PCR products were diluted into an 
LDR reaction containing 24 fluorescently labeled allele-specific LDR probes and 12 
adjacent common LDR probes, with products separated on an automated DNA 

30 sequencer. LDR probe sets were designed in two ways: (i) allele-specific FAM- or 
TET-labeled LDR probes of uniform length, or (ii) allele-specific HEX-labeled LDR 
probes differing in length by two bases. A comparison of LDR profiles of several 
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individuals demonstrated the ability of PCR/LDR to distinguish both homozygous and 
heterozygous genotypes at each locus (Id.). The use of PCR/PCR in human 
identification to simultaneously amplify 26 loci has been validated (Lin, Z., et al., 
"Multiplex Genotype Determination At A Large Number Of Gene Loci," Proc Natl 
5 Acad Sci USA . 93(6):2582-7 (1996)), or ligase based detection to distinguish 32 
alleles although the latter was in individual reactions (Nickerson, D. A., et al., 
"Identification Of Clusters Of Biallelic Polymorphic Sequence-Tagged Sites (pSTSs) 
That Generate Highly Informative And Automatable Markers For Genetic Linkage 
Mapping," Genomics . 12(2):377-87 (1992)). This study validates the ability to 
10 multiplex both PGR and LDR reactions in a single tube, which is a prerequisite for 
developing a high throughput method to simultaneously detect SNPs throughout the 
genome. 

For the PCR/PCR/LDR approach, two long PGR primers are required 
for each SNP analyzed. A metliod which reduces the need for multiple PGR primers 
15 would give significant savings in time and cost of a large-scale SNP analysis. The 
present invention is directed to achieving this objective. 

SUMMARY OF THE INVENTION 

20 The present invention is directed to a method of assembling genomic 

maps of an organism's DNA or portions thereof. A library of an organism's DNA is 
provided where the individual genomic segments or sequences are found on more 
than one clone in the library. Representations of the genome are created, and nucleic 
acid sequence information is generated from the representations. The sequence 

25 information is analyzed to determine clone overlap from a representation. The clone 
overlap and sequence information from different representations is combined to 
assemble a genomic map of the organism. 

As explained in more detail infra, the representation can be created by 
selecting a subpopulation of genomic segments out of a larger set of the genomic 

30 segments in that clone. In particular, this is achieved by first subjecting an individual 
clone to a first restriction endonuclease under conditions effective to cleave DNA 
from the individual clone so that a degenerate overhang is created in the clone. 
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Non-palindromic complementary linker adapters are added to the overhangs in the 
presence of ligase and the first restriction endonuclease to select or amplify particular 
fragments from the first restriction endonuclease digested clone as a representation. 
As a result, sufficient linker-genomic fragment products are formed to allow 
5 determination of a DNA sequence adjacent to the overhang. Although a number of 
first restriction endonucleases are suitable for use in tliis process, it is particularly 
desirable to use the enzyme Drdl to create the representation which comprises what 
are known as Drdl islands (i.e. the genomic segments which are produced when Drdl 
cleaves the genomic DNA in the clones). 

1 0 The procedure is amenable to automation and requires just a single 

extra reaction (simultaneous cleavage/ligation) compared to straight dideoxy 
sequencing. Use of from 4 to 8 additional linker adapters/primers is compatible with 
microtiter plate format for delivery of reagents. A step which destroys the primers 
after the PGR amplification allows for direct sequencing without purifying the PGR 

1 5 products. 

A method is provided for analyzing sequencing data allowing for 
assignment of overlap between two or more clones. The method deconvolutes singlet, 
doublet, and triplet sequencing runs allowing for interpretation of the data. For 
sequencing runs which are difficult to interpret, sequencing primers containing an 

20 additional one or two bases on the 3' end will generate a readable sequence. As an 
alternative to deconvoluting doublet and triplet sequencing runs, other enzymes may 
be used to create short representational fi-agments. Such fi-agments may be 
differentially enriched via ultrafiltration to provide dominant signal, or, alternatively, 
their differing length provides unique sequence signatures on a full length sequencing 

25 run. 

About 200,000 to 300,000 Z)rrf Islands are predicted in the human 
genome. The Drdl Islands are a representation of 1 /1 5 to 1 /1 0 of the genome. 
With an average BAG size of 100-150 kb, a total of 20,000 to 30,000 BAG clones 
would cover the human genome, or 150,000 clones would provide 5-fold coverage. 
30 Using the Drdl island approach, 4-6 sequencing runs are required for a total of 
600,000 to 900,000 sequencing reactions. New automated capillary sequencing 
machines (Perkin Elmer 3700 machine) can run 2,304 short (80-1 OObp) sequencing 
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reads per day. Thus, die Drdl approach for overlapping all BAG clones providing a 
5-fold coverage of the human genome would require only 39 days using 10 of the new 
DNA sequencing machines. 

The above approach will provide a highly organized contig of ihe 
5 entire genome for just under a million sequencing reactions, or about 1/70**^ of the 
effort required by just random clone overlap. Subsequently, random sequencing will 
fill in the sequence information between Drdl islands. Since the islands are anchored 
in the contig, this will result in a 2- to 4-fold reduction in the amount of sequencing 
necessary to obtain a complete sequence of the genome. 

10 Single nucleotide polymorphisms or SNPs have been proposed as 

valuable tools for gene mapping and discovering genes associated with common 
diseases. The present invention provides a rapid method to find mapped single 
nucleotide polymorphisms within genomes. A representation of the genomes of 
multiple individuals is cloned into a common vector. Sequence information generated 

1 5 fi'om representational library is analyzed to determine single nucleotide 
polymorphisms. 

The present invention provides a method for large scale detection of 
single nucleotide polymorphisms ("SNP"s) on a DNA array. This method involves 
creating a representation of a genome from a clinical sample. A plurality of 

20 oligonucleotide probe sets are provided with each set characterized by (a) a first 
oligonucleotide probe, having a target-specific portion and an addressable array- 
specific portion, and (b) a second oligonucleofide probe, having a target-specific 
portion and a detectable reporter label. The oligonucleotide probes in a particular set 
are suitable for ligation together when hybridized adjacent to one another on a 

25 corresponding target nucleotide sequence, but have a mismatch which interferes with 
such ligation when hybridized to any other nucleotide sequence present in the 
representation of the sample. A mixture is formed by blending the sample, the 
plurality of oligonucleotide probe sets, and a ligase. The mixture is subjected to one 
or more ligase detection reaction ("LDR") cycles comprising a denaturation 

30 treatment, where any hybridized oligonucleotides are separated from the target 

nucleotide sequences, and a hybridization treatment, where the oligonucleotide probe 
sets hybridize at adjacent positions in a base-specific manner to their respective target 
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nucleotide sequences, if present in the sample, and ligate to one another to form a 
ligation product sequence containing (a) the addressable array-specific portion, (b) the 
target-specific portions connected together, and (c) the detectable reporter label. The 
oligonucleotide probe sets may hybridize to nucleotide sequences in the sample other 
5 than their respective target but do not ligate together due to a presence of one or more 
mismatches and individually separate during the denaturation treatment. A solid 
support with different capture oligonucleotides immobilized at particular sites is 
provided where the capture oligonucleotides have nucleotide sequences 
complementary to the addressable array-specific portions. After subjecting the 

10 mixture to one or more ligase detection reaction cycles, the mixture is contacted with 
the solid support under conditions effective to hybridize the addressable array-specific 
portions to the capture oligonucleotides in a base-specific manner. As a result, the 
addressable array-specific portions are captured on the solid support at the site with 
the complementary capture oligonucleotide. Finally the reporter labels of ligation 

1 5 product sequences captured to the solid support at particular sites are detected which 
indicates the presence of single nucleotide polymorphisms. 

It has been estimated that 30,000 to 300,000 SNPs will be needed to 
map the positions of genes which influence the major multivariate diseases in defined 
populations using association methods. Since the above SNP database is connected to 

20 a closed map of the entire genome, new genes may be rapidly discovered. Further, 
the representative PCR/ LDR / universal array may be used to quantify allele 
imbalance. This allows for use of SNPs to discover new tumor suppressor genes, 
which undergo loss of heterozygosity, or oncogenes, which undergo amplification, in 
various cancers. 

25 

BRIEF DESCRIPTION OF THE DRAWINGS 



Figure 1 is a schematic drawing showing the sequencing of Drdl 
islands in random plasmid or cosmid clones in accordance with the present invention. 
30 Figure 2 is a schematic drawing of a first embodiment for sequencing 

restriction enzyme generated representations. 
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Figure 3 is a schematic drawing of a second embodiment for 
sequencing restriction enzyme generated representations. 

Figure 4 is a schematic drawing for DNA sequencing directly from 
PGR amplified DNA without primer interference. 
5 Figure 5 is a schematic drawing showing another embodiment of the 

Drdl island sequencing technique of the present invention. 

Figure 6 is a schematic drawing showing a further alternative 
embodiment of sequencing Drdl islands in random BAG clones using PGR 
amplification. 

1 0 Figure 7 shows the three degrees of specificity in amplifying a Drdl 

representation. 

Figure 8 shows the Drdl and Bgll site frequencies per 40kb in the Met 
Oncogene BAG from the 7q31 chromosome. The locations of the 12 Drdl and 16 
BgH sites, in a 171 ,905 bp clone are shown pictorially and in tabular form, indicating 

1 5 the type of overhang and the complement to that overhang. For this clone, per 40 kb, 
the unique sites (i.e.singlets) are as follows: 1 .4 of such unique Drdl sites and 3.3 of 
such unique Bgll sites. In this clone, per 40 kb, the sites with the 3 'overhang having 
the same last 2 bases - doublets (i.e. *) are as follows: 1 .0 of such Drdl sites and 4.3 
of such Bgll sites. The number of palindromic overhangs not used (i.e. ^) is as 

20 follows: 2 overhangs for Drdl and 0 overhangs for BgR, The number of sites with 
the 3' overhang having the same last 2 bases within the BAG clone used exactly 
once — singlets (i.e. @) is as follows: 2 of such Drdl sites and 5 of such BgH sites. 
The number of sites with the 3' overhang having the same last 2 bases within the 
BAG clone used exactly twice — doublets (i.e. #) is as follows: 4 of such Drdl sites 

25 and 5 of such Bgll sites. The number of sites with the 3' overhang having the same 
last 2 bases within the BAG clone used more than twice (i.e. X) is as follows: 0 of 
such Drdl sites and 3 of such BgR sites. 

Figure 9 shows the Sapl site frequencies per 40kb in the Met Oncogene 
BAG from the 7q3 1 chromosome. The locations of the 25 Sapl sites in a 1 71 ,905 bp 

30 clone are shown pictorially and in tabular form, indicating the type of overhang and 
the complement to that overhang. The number of sites with the 3' overhang having 
the same last 2 bases within the BAG clone used exactly once — singlets (i.e. @) is 5 
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of such Sap\ sites. The number of sites with the 3' overhang having the same last 2 
bases within the BAG clone used exactly twice — doublets (i.e. U)\s 10 of such Sap\ 
sites. The number of sites with the 3' overhang having tlie same last 2 bases within 
the BAG clone used more than twice (i.e. X) is 3 of such Sapl sites. 
5 Figure 10 shows the Drdl and Bgll site frequencies per 40kb in the 

HMG Oncogene BAG from the 7q3 1 chromosome. The locations of the 1 1 Drdl and 
12 Bgll sites in a 165,608 bp clone are shown pictorially and in tabular form, 
indicating the type of overhang and the complement to that overhang. For this clone, 
per 40 kb, the unique sites (i.e. singlets) are as follows: 1 .2 of such unique Drdl sites 

10 and 3.9 of such unique BgR sites. In this clone, per 40 kb, the sites with the 3' 

overhang having the same last 2 bases-doublets (i.e. *) are as follows: 1 .2 of such 
Drdl sites and 2.0 of such BgH sites. The number of palindromic overhangs not used 
(i.e. ^) is as follows: 1 overhang for Drdl and 0 overhangs for Bgll. The number of 
sites with the 3' overhang having the same last 2 bases v^thin the BAG clone used 

15 exactly once — singlets (i.e. @) is as follows: 3 of such Drdl sites and 5 of such Bgll 
sites. The number of sites with the 3' overhang having the same last 2 bases within 
the BAG clone used exactly twice — doublets (i.e. #) is as follows: 2 of such Drdl 
sites and 4 of such BgH sites. The number of sites with the 3' overhang having the 
same last 2 bases within the BAG clone used more than twice (i.e. X) is as follows: 1 

20 of such Drdl sites and 3 of such Bgll sites. 

Figure 1 1 shows the Sapl site frequencies per 40kb in the HMG 
Oncogene BAG from the 7q31 chromosome with the locations of the 12 Sapl sites in 
a 165,608 bp clone being shown in pictorial and tabular form, indicating the type of 
overhang and the complement to that overhang. The nimiber of sites with the 3' 

25 overhang having the same last 2 bases within the BAG clone used exactly once — 
singlets (i.e. @) is 4 of such Sapl sites. The number of sites with the 3' overhang 
having the same last 2 bases v^thin the BAG clone used exactly twice — doublets (i.e. 
#) is 1 of such Sapl sites. The number of sites with the 3' overhang having the same 
last 2 bases with BAG in the clone used more than twice (i.e. X) is 2 of such Sapl 

30 sites. 

Figure 12 shows the Drdl and Bgll site frequencies per 40kb in the 
Pendrin Oncogene BAG from the 7q31 chromosome with the locations of the 10 Drdl 
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and 1 7 Bgfl sites in a 97,943 bp clone being shown in pictorial and tabular form, 
indicating the type of overhang, and the complement to that overhang. For this clone, 
per 40 kb, the unique sites are as foUov^s: 1 .3 of such unique Drdl sites and 5.0 of 
such unique Bgll sites. In this clone, per 40 kb, the sites with the 3' overhang having 
5 the same last 2 bases— doublets (i.e. *) are as follows: 2. 1 of such Drdl sites and 9.2 
of such Bgll sites. The number of palindromic overhangs not used (i.e. ^) is as 
follows: 2 overhangs for Z)rrfl and 0 overhangs for The number of sites with 
the 3' overhang having the same last 2 bases within the BAC clone used exactly 
once — singlets (i.e. @) is as follows: 3 of such Drdl sites and 1 of such Bgll sites. 

10 The number of sites with the 3' overhang having the same last 2 bases within the 
BAC clone used exactly twice — doublets (i.e. #) is as follows: 1 of such Drdl sites 
and 5 of such Bgfl sites. The number of sites with the 3' overhang having the same 
last 2 bases within the BAC clone used more than twice (i.e. X) is as follows: 1 of 
such Drdl sites and 7 of such Bgll sites. 

1 5 Figures 1 3 shows the Sapl site frequencies per 40kb in the Pendrin 

gene BAC from the 7q31 chromosome with the locations of the 14 Sapl sites in a 
97,943 bp clone being shown in pictorial and tabular form, indicating the type of 
overhang and the complement to that overhang. The number of sites with the 3' 
overhang having the same last 2 bases within the BAC clone used exactly once — 

20 singlets (i.e. @) is 7 of such Sapl sites. The number of sites with the 3' overhang 

having the same last 2 bases within the BAC clone used exactly twice — doublets (i.e. 
#) is 2 of such Sapl sites. The number of sites with the 3' overhang having the same 
last 2 bases within the BAC clone used more than twice (i.e. X) is 1 of such Sapl 
sites. 

25 Figure 14 shows the Drdl and Bgll site frequencies per 40kb in the 

alpha2(I) collagen BAC from the 7q3 1 chromosome with the locations of the 1 1 Drdl 
and 15 Bgll sites in a 1 16,466 bp clone being in pictorial and tabular form, indicating 
the type of overhang and the complement to that overhang. For this clone, per 40 kb, 
the unique sites are as follows: 1 .4 of such unique Drdl sites and 3. 1 of such unique 

30 Bgll sites. In this clone, per 40 kb, the sites with the 3' overhang having the same last 
2 bases-doublets (i.e. *) are as follows: 2.1 of such Drdl sites and 7.2 of such Bgll 
sites. Tlie number of palindromic overhangs not used (i.e. ^) is as follows: I 
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overhang for Drdl and 0 overhangs for BglL The number of sites with the 3' 
overhang having the same last 2 bases within the BAC clone used exactly once — 
singlets (i.e. @) is as follows: 2 of such Drdl sites and 4 of such Bgll sites. The 
number of sites with the 3' overhang having the same last 2 bases within the BAC 
5 clone used exactly twice — doublets (i.e. U) is as follows: 4 of such Drdl sites and 7 of 
such Bgll sites. The number of sites with the 3' overhang having the same last 2 
bases within the BAC clone used more than twice (i.e. X) is as follows: 0 of such 
Drifl sites and 3 of such Bgll sites. 

Figures 15 shows the Sapl site frequencies per 40kb in the alpha2(I) 
1 0 collagen BAC from the 7q3 1 chromosome with the locations of the 1 8 Sapl sites in a 
1 1 6,466 bp clone being in pictorial and tabular form, indicating the 1 8 Sapl site 
locations, the type of overhang, and the complement to that overhang. The number . 
of sites with the 3' overhang having the same last 2 bases v^thin the BAC clone used 
exactly once — singlets (i.e. @) is 4 of such Sapl sites. The number of sites with the 
15 3' overhang having the same last 2 bases within the BAC clone used exactly twice — 
doublets (i.e. #) is 3 of such Sapl sites. The number of sites with the 3' overhang 
having the same last 2 bases within the BAC clone used more than twice (i.e. X) is 2 
of such Sapl sites. 

Figure 16 is a schematic drawing shoving the sequencing of Bgll 
20 islands in random BAC clones in accordance with the present invention. 

Figure 16A is a schematic drawing showing the sequencing of Bgll 
islands in random BAC clones using PCR amplification. 

Figure 1 7 is a schematic drawing showing the sequencing of Sapl 
islands in random BAC clones in accordance with the present invention. 
25 Figure 17A shows the probabilities of two or more singlets or doublets 

of Drdl, Sapl, or Bgll sites in BAC clones containing 2 to 36 sites. 

Figure 18 shows the alignment of BAC clone sequences, which are 
concordant and discordant, fi-om Drdl sites. 

Figure 19 shows Drdl/Msel fragments in approximately 2 MB of 
30 human DNA. The average fragment size is about 125 bp, with most fragments being 
under 600 bp. 
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Figure 20 shows Drdl/MspUTaql fragments in approximately 2 MB of 
human DNA. The average fragment size is about 1,000 bp, with most fragments 
being over 600 bp. 

Figure 21 shows how 4 unique singlet Drdl sequences are determined 
5 from 2 overlapping doublet BAG clone sequences. 

Figure 22 shows how 3 unique singlet Drdl sequences are determined 
from overlapping doublet and triplet BAG clone sequences. 

Figure 23 shows the Bgll, Drdl, and Sapl sites in the pBeloBACl 1 

cloning vector. 

1 0 Figure 24 shows the Sg/I, Drdl, and Sapl sites in the pUC 1 9 cloning 

vector. 

Figure 25 is a schematic drawing showing the sequencing of BamH] 
islands in random BAG clones. 

Figure 26 shows tlie EcoKl, Hindlll, and Bamlll site frequencies for 

1 5 the Met Oncogene in a sequenced BAG clone from the 7q3 1 chromosome. There are 
19 BamHl sites, 49 £coRI sites, and the 64 Hindlll sites within 171 ,905 bp clone as 
shown. The number of BamUl sites that are the same where the 2 bases next to the 
site within the BAG clone are used exactly once — a singlet (i.e. @) is 6. The number 
of Bamlll sites that are the same where the 2 bases next to the site within the BAG 

20 clone are used exactly twice — a doublet (i.e. #) is 2. The number of BamHl sites that 
are the same where the 2 bases next to the site within the BAG clone are used more 
than once is 2. 

Figure 27 shows the Avrll, Nhel, and Spel site frequencies for the Met 
Oncogene in a sequenced BAG clone from the 7q31 chromosome. There are the 25 

25 Avrll sites, 22 Nhel sites, and the 21 Spel sites within the 171,905 bp clone shown. 
The number of -4vrII sites that are the same where the 2 bases next to the site within 
the BAG clone are used exactly once — a singlet (i.e. @) is 5. The number of ^vrll 
sites that are the same where the 2 bases next to the site vi^ithin the BAG clone are 
used exactly twice — a doublet (i.e. #) is 2. The number of Avrll sites that are the 

30 same where the 2 bases next to the site within the BAG clone are used more than once 
is 3. The number of Nhel sites that are the same where the 2 bases next to the site 
within the BAG clone are used exactly once — a singlet (i.e. @) is 3, The number of 
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Nhe\ sites that are the same where the 2 bases next to the site within the BAC clone 
are used exactly twice — a doublet (i.e. U) is 3. The number of Nhel sites that are the 
same where the 2 bases next to the site within the BAC clone are used more than once 
is 3. The number oiSpel sites that are the same where the 2 bases next to the site 
5 within tlie BAC clone are used exactly once — a singlet (i.e. @) is 3. The number of 
Avrll sites that aie the same where the 2 bases next to the site within the BAC clone 
are used exactly twice — a doublet (i.e. #) is 3. The number of AvrW sites that are the 
same where the 2 bases next to the site within the BAC clone are used more than once 
is 3. 

10 Figure 28 is a schematic drawing showing the sequencing of BsiYiKAl 

islands in random BAC clones. 

Figures 29 shows the Acc\ and 55/HKAI site frequencies for the Met 
Oncogene in a sequenced BAC clone from the 7q31 chromosome. 71 Accl sites and 
127 55/HKAI sites within 171,905 bp clone are shown. The number of Accl sites that 

15 are the same where the 2 bases next to the site within the BAC clone are used exactly 
once — a singlet (i.e. @) is 4. The number of Accl sites that are the same where the 2 
bases next to the site within the BAC clone are used exactly twice — a doublet (i.e. #) 
is 2. The number of Accl sites that are the same where tlie 2 bases next to the site 
within the BAC clone are used more than once is 0. The number of 55/HKAI sites 

20 that are the same where the 2 bases next to the site within the BAC clone are used 

exactly once — a singlet (i.e. @) is 6. The number of 55/HKAI sites that are the same 
where the 2 bases next to the site within the BAC clone are used exactly twice — a 
doublet (i.e. #) is 3. The number of &/HKAI sites that are the same where the 2 
bases next to the site within the BAC clone are used more than twice is 0. 

25 Figure 30 is a schematic drawing showing the sequencing of SariDl 

islands in random BAC clones. 

Figure 31 shows the SariDl and SexAl site frequencies for the Met 
Oncogene in a sequenced BAC clone from the 7q31 chromosome. There are 13 
SariDl sites and 1 5 SexPd within the 1 71 ,905 bp clone. The number of SariDl sites 

30 that are the same where tiie 2 bases next to the site within the BAC clone are used 
exactly once — a singlet (i.e. @) is 3. The number of SanDl sites that are the same 
where the 2 bases next to the site within the BAC clone are used exactly twice — a 
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doublet (i.e. #) is 5. The number ofSanDl sites that are the same where the 2 bases 
next to the site within the BAG clone are used more than once is 0. The number of 
SexAl sites that are the same where the 2 bases next to the site within the BAG clone 
are used exactly once — a singlet (i.e. @) is 8. The number of SexAl sites that are the 
5 same where the 2 bases next to die site within the BAG clone are used exactly twice — 
a doublet (i.e. #) is 2. The number of SexAl sites that are the same where the 2 bases 
next to the site within the BAG clone are used more than twice is 1 . 

Figure 32 shows thcy^ccl and BsiHKAl sites in the pBeloBAGl 1 
cloning vector. There are 6 Accl sites and 8 BsiHKAl sites. 
10 Figure 33 shows the Avrll, BamHl, Nhel, and Spel sites in the 

pBeloBAGl 1 cloning vector. 

Figures 34 shows the SanDl and SexAl sites in the pBeloBAGl 1 

cloning vector. 

Figure 35 shows the Drdl^ BgR, Sapl, Taql, and Mspl sites in a 
1 5 sequenced BAG cloning vector from the 7q3 1 chromosome. There are 12 Drdl sites, 
16 BgH sites, 25 Sapl sites, 63 Taql sites, and 86 Mspl sites in the 171,905 base pairs.. 

Figure 36 shows the three degrees of specificity in amplifying a Bgll 

representation. 

Figure 37 shows Scheme 1 for sequencing for Drdl and Bgll 
20 representations of individual BAG clones. 

Figure 38 shows overlapping Drdl islands in four hypothetical BAG 
clones using AA overhangs. 

Figure 39 shows overlapping Drdl islands in four hypothetical BAG 
clones using AC overhangs. 
25 Figure 40 shows overlapping Drdl islands in four hypothetical BAG 

clones using AG overhangs. 

Figure 41 shows overlapping Drdl islands in four hypothetical BAG 
clones using CA overhangs. 

Figure 42 shows overlapping Drdl islands in four hypothetical BAG 
30 clones using OA overhangs. 

Figure 43 shows overlapping Drdl islands in four hypothetical BAG 
clones using GO overhangs. 
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Figure 44 shows overlapping Drdl islands in four hypothetical B AC 
clones using AA, AC, AG, CA, GA, and GG overhangs. 

Figure 45 shows the alignment of the four hypothetical BAG clones 
based upon on the unique and overlapping Drdl islands depicted in Figures 38 to 44. 
5 Figure 46 shows the sizes of representational fragments generated by 

Drdl, Taql and Mspl digestion in overlapping BACs from 7q31. When such 
fragments are amplified using linker ligation/PCR amplification, they will contain 
approximately 25 additional bases on each side. Sizes of fragments were determined 
from 3 separate contigs on 7q31 known as contig 1941 (BACs RG253B13, 

10 RG013N12, and RG300C03), contig T002144 (BACs RG022J17, RG057E13, 
RGOl 1J21, RG022C01, and RG043K06), and contig T002149 (RG343Pi3, 
RG205G13, O68P20, andH_133K23). Overlaps between BACs in contig 1941 are 
indicated by tlie following symbols: RG253B13/RG013N12 = *, RG013N12/R 
RG300C03 = t. Overlaps between BACs in contig T002144 are indicated by the 

1 5 following symbols: RG022J 1 7/RG067E1 3 = *, RG067E 1 3/RGO 1 1 J2 1 = t, 

RGOl 1 J21 / RG022C01 = f, and RG022C01/ RG043K06 = Overlaps between 
BACs in contig T002149 are indicated by the following symbols: RG343P13/ 
RG205G13 = *,RG205G13/O68P20 = t, and O68P20/H_133K23 = J, 

Figure 47 shows the sizes of representational fragments generated by 

20 Drdl and Msel digestion in overlapping BACs from 7q3 1 . When such fragments are 
amplified using linker ligation/PCR amphfication, they will contain approximately 25 
additional bases on each side. Sizes of fragments were determined from 3 separate 
contigs on 7q31 known as contig 1941 (BACs RG253B13, RG013N12, and 
RG300C03), contig T002144 (BACs RG022J17, RG067E13, RGOl 1J21, RG022C01, 

25 and RG043K06), and contig T002149 (RG343P13, RG205G13, O68P20, and 

H_133K23). Overlaps between BACs in contig 1941 are indicated by the following 
symbols: RG253B13/RG013N12 = RG013N12/RRG300C03 = t Overlaps 
between BACs in contig T002144 are indicated by the following symbols: 
RG022J17/RG067E13 = *, RG067E13/RG01 1 J21 = t, RGOl 1 J21 / RG022C01 = J, 

30 and RG022C0 1 / RG043K06 = * * . Overlaps between BACs in contig T002 1 49 are 
indicated by the following symbols: RG343P13/ RG205G13 = *, RG205G13/ 
O68P20 = t, and O68P20/ H_133K23 = J. 
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Figure 48 shows the Drd[, Taq\, and Mspl sites in 4 sequenced BAC 
clones from a 7q31c chromosome as well as the location and identities of the A A, 
AC, AG, CA, GA, and GG overhangs and their overhangs. 

Figure 49 is a schematic drawing showing tlie PGR amplification of a 
5 Drrf/ representation for shotgun cloning and generating mapped SNPs. 

Figure 49A is a schematic drawing of the PGR amplification of a Drdl 
representation for shotgun cloning and generating mapped SNPs. 

Figure 50 is a schematic drawing showing the PGR amplification of a 
DrcH representation for high-throughput SNP detection. 

1 0 Figure 50A is an alternative schematic drawing showing the PGR 

amplification oidiDrdi representation for high-throughput SNP detection. 

Figures 51 A-B show the quantitative detection of G 12V mutation of 
the K-rOT gene using two LDR probes in the presence of 10 micrograms of salmon 
sperm DNA. Figure 5 1 A is a graph showing the amount of LDR product formed is a 

15 linear function of K-ra5 mutant DNA template, even at very low amounts of template. 
Figure 5 IB is a log-log graph of amount of LDR product formed for various amount 
of K-ra^ mutant DNA in a 20 ^il LDR reaction. The amount of LDR product formed 
wdth 2.5 pM (50 amol) to 3 nM (60 fmol) of mutant K-ras template was determined in 
duplicate using fluorescent probes on an ABD 373 DNA sequencer. 

20 Figures 52A-B show a scheme for PGR/LDR detection of mutations in 

codons 12 and 13 of K-ra5. using an addressable array. Figure 52 A shows a 
schematic representation of chromosomal DNA containing the Y^-ras gene. Exons are 
shaded and the position of codons 12 and 13 are shown. Exon-specific primers were 
used to selectively amplify K-ros DNA flanking codons 12 and 13. Probes were 

25 designed for LDR detection of seven possible mutations in these two codons. 

Discriminating LDR probes contained a complement to an address sequence on the 5' 
end and the discriminating base on the 3' end. Gommon LDR probes were 
phosphorylated on the 5' end and contained a fluorescent label on the 3' end. 
Figure 52B shows the presence and type of mutation is determined by hybridizing the 

30 contents of an LDR reaction to an addressable DNA array. The capture 

oligonucleotides on the array have sequences which are designed to be sufficiently 
different, so that only probes containing the correct complement to a given capture 
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oligonucleotide remain bound at that address. In the LDR reaction, only a portion of 
the hybrid probe is hgated to its adjacent common fluorescently labeled probe (in the 
presence of the correct target). Thus, for every hybridization, an identical quantity of 
addressable array-specific portion competes for hybridization to each address. This 
5 feature allows for simultaneous identification and quantification of LDR signal. 

Figure 53 shows the array hybridization ofK-ras LDR products. 
Arrays were hybridized for 1 hour at 65 °C in a hybridization oven with nine 
individual LDR reactions (1 7 \xL) diluted to 55 ^L with 1 .4X hybridization buffer. 
Following hybridization, arrays were washed for 10 minutes at room temperature in 

1 0 300 mM bicine pH 8.0, 1 0 mM MgC^, 0. 1 % SDS. The arrays were analyzed on an 
Olympus AX70 epifluorescence microscope equipped with a Princeton Instruments 
TE/CCD-512 TKBMl camera. The images were collected using a 2 second exposure 
time. All nine arrays displayed signals corresponding to the correct mutant and/or 
wild-type for each tumor or cell line sample. The small spots seen in some of the 

1 5 panels, i.e. neai* the center of the panel containing the G13D mutant, are not incorrect 
hybridizations, but noise due to small bubbles in the polymer. 

Figures 54A-B show the quantification of minority fluorescently- 
labeled oligonucleotide probe captured by a universal addressable array using two 
different detection instruments. Hybridizations were carried out using 55 |il 

20 hybridization buffer containing 4,500 fmole fluorescently-labeled common probes, 9 
X 500 finole of each unlabeled, addressable array-specific portion-containing 
discriminating probe, and 1 to 30 fmol CZipl3 oligonucleotide. Figure 54 A shows 
the quantification of the amount of captured CZipl3 oligonucleotide using a 
Molecular Dynamics 595 Fluorlmager. Figure 54B shows the quantification of the 

25 amount of captured CZipl 3 oligonucleotide using an Olympus AX70 epifluorescence 
microscope equipped with a Princeton Instruments TC/CCD-5 1 2 TKBM 1 camera. 

Figure 55 shows how an allelic imbalance can be used to distinguish 
gene amplification firom loss of heterozygosity (i.e. LOH) in tumor samples which 
contain stromal contamination. 

30 Figure 56 shows the PCR/LDR quantification of different ratios of K- 

ras G12V mutant to wild-type DNA. LDR reactions were carried out in a 20 \i\ 
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reaction containing 2 pmol each of the discriminating and wild type ("wt") probe, 
4 pmol of the common probe and 1 pmol total of various ratios of PGR product (pure 
wt and pure G12V mutant) from cell lines (HT29 and SW620). LDR reactions were 
thermally cycled for 5 cycles of 30 sec at 94°C and 4 min. at 65°C, and quenched on 
5 ice. 3 ^1 of the LDR reaction product was mixed with 1 fil of loading buffer (83% 
formamide, 83 mM EDTA, and 0.17% Blue Dextran) and 0.5 ml TAMRA 350 
molecular weight marker, denatured at 94°C for 2 minutes, chilled rapidly on ice prior 
to loading on a 8 M urea- 10% polyacrylamide gel, and electrophoresed on an ABI 
373 DNA sequencer at 1400 volts. Fluorescent ligation products were analyzed and 
10 quantified using the ABI GeneScan 672 software (Perkin-Elmer Biosystems, Foster 
City, CA). The amount of product obtained was calculated using the peak area and 
from the calibration curve (1 fihol = 600 peak area units). The normalized ratio was 
obtained by multiplying or dividing the absolute ratio by the 1 :1 absolute ratio. 

Figures 57A-B are schematic drawings showing PCR/LDR procedures 
1 5 using addressable DNA arrays where there are 2 alternative labeling schemes for 
capture on the array. 

Figure 58 is a schematic diagram showing a labeling scheme for 
PCR/SNUPE with addressable array capture. 

Figure 59 is a diagram showing a labeling scheme for PCR/LDR with 
20 gene array capture. 

Figure 60 is a schematic diagram showing a labeling scheme for 
LDR/PCR with addressable array capture. 

Figure 61 is a diagram showing a labeling scheme for LDR/PCR with 
lambda exonuclease digestion and addressable array capture. 
25 Figures 62A-B are schematic drawings showing 2 alternative dual 

label strategies to quantify LDR signal using addressable DNA arrays. 

Figure 63 shows the detection of gene amplification in tumor samples 
which contain stromal contamination using addressable array-specific portions on the 
discriminating oligonucleotide probe. 
30 Figure 64 shows the detection of gene amphfication in tumor samples 

which contain stromal contamination using addressable array-specific portions on the 
common oligonucleotide probe. 
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Figure 65 shows the detection of heterozygosity (i.e. LOH) in tumor 
samples which contain stromal contamination using addressable array-specific 
portions on the discriminating oligonucleotide probes. 

Figure 66 shows the detection of heterozygosity (i.e. LOH) in tumor 
5 samples which contain stromal contamination using addressable array-specific 
portions on the common oligonucleotide probes. 

Figure 67 shows the calculations for the detection procedure shown in 

Figure 63. 

Figure 68 shows the calculations for the detection procedure shown in 

10 Figure 64. 

Figure 69 shows the calculations for the detection procedure shown in 

Figure 65. 

Figure 70 shows the calculations for the detection procedure shown in 

Figure 66. 

15 Figure 71 shows the fidelity of T4 DNA ligase on synthetic 

target/linker. T4 DNA ligase assays were performed with linkers containing 2 base 3' 
overhangs (GG, AA, AG, and GA) and synthetic targets containing 2 base 3' 
complementary or mismatched overhangs (CC, TT, TC, and CT). Products represent 
both top and bottom strand ligation products. Synthetic targets were designed such 

20 that the common strand (top strand) provided a 39 nucleotide product (common 

product), while the specific strand (bottom strand) provided a 48 (CC, TT), 52 (CT), 
or 56 (TC) nucleotide product. Only the correct complement product is observed, 
while there were no misligations. Since TT- and CC- targets result in the same length 
products, TT-targets are not present in GG-linker assays and CC-targets are not 

25 present in AA-linker assays. For AG- and GA-linker assays, all four targets (TC-, 

CT-, CC-, and TT-) are present. Synthetic complementary target was present at 5 nM, 
and each linker/adapter was present at either 50 nM (=10x concentration), or 500 nM 
(=100x concentration). 

Figure 72 shows Drdl representations of human genomic DNA. The 

30 Drdl representation of human genomic DNA was generated by "regular PGR" and 

"touchdown PGR" using 3 and 4 base selection PCR primers. The six lanes following 
the 100 bp ladder lane were the PCR amplification of Drdl AG- overhang fragments 
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of human genome by regular PGR and touchdown PGR using AGG, AGA, AGAT, 
and AGAG selection primers, respectively. The last six lanes were the PGR 
amplification of Drdl GA- overhang fragments of human genome by regular PGR and 
touchdown PGR using GAG, GAT, GAGT, and GATG selection primers, 
5 respectively. 

Figure 73 shows the sensitivity of a PGR/LDR reaction. Human 
genomic DNA was subjected to PGR amplification using region specific primers, 
followed by LDR detection using LDR probes specific to the amplified regions. 
Aliquots of 3 \i\ of the reaction products were mixed with 3 |il of loading buffer (83% 

10 formamide, 8.3 mM EDTA, and 0.17% Blue Dextran) and 0.5 [i\ Rox-1000, or 

TAMRA 350 molecular weight marker, denatured at 94°G for 2 min., chilled rapidly 
on ice prior to loading on an 8 M urea- 10% polyacrylamide gel, and electrophoresed 
on an ABI 373 DNA sequencer at 1400 volts. Fluorescent ligation products were 
analyzed and quantified using the ABI Gene Scan software. The first six lanes were 

1 5 the results of an LDR assay of PGR amplified human genomic DNA using probes 
which amplify fragments which should be present in AGA Drdl representations; 
without salmon sperm DNA, and 500, 1,500, 4,500, 13,500 fold dilutions in 10 ^ig 
salmon sperm DNA, and 10 |ig salmon sperm DNA alone, respectively. The last six 
lanes were the results of an LDR assay of PGR amplified human genomic DNA using 

20 probes which amplify fragments which should be present in AGG Drdl 

representations; without sahnon sperm DNA. and 500, 1,500, 4,500, 13,500 fold 
dilutions in 1 0 |ig salmon sperm DNA, and 1 0 |ig salmon sperm DNA alone, 
respectively. 

Figure 74 shows LDR detection of AG- overhang representations of 
25 the human genome. Drdl representations were generated by the ''regular PGR" and 
the "touchdown PGR" using common probe MTGG228 and 3 and 4 base selection 
PGR primers AGAP60, AGGP61, AGATP62, and AGAGP63. The presence of 
specific fi'agments in the representation were detected by LDR using probes specific 
to the amplified regions (Tables 16). In the REF lane, used as the standard, were 
30 LDR results of PGR products generated from probes designed for each of the targeted 
regions in the human genome. The labels on the left refer to the four bases present at 
the Drdl site and the number in parenthesis represents the predicted length of the 
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Drdl'Mspl/Taql fragment. The four lanes following the REF lane were the LDR 
results of detecting representation generated by regular PCR and touchdown PCR 
using AGC reach in primer AGCP61 , respectively. The four lanes under AG A 
representation were the LDR results of detecting representation generated by regular 
5 PCR and touchdown PCR with AGA reach in primer AGAP60, respectively. The 
four lanes under AGAT representation were the LDR results of detecting 
representation generated by regular PCR and touchdown PCR with AGAT reach in 
primer AGATP62, respectively. The four lanes under AGAG representation were the 
LDR results of detecting representation generated by regular PCR and touchdown 

1 0 PCR with AGAG reach in primer AGAGP63, respectively. 

Figure 75 shows LDR detection of CA- overhang representations of 
the human genome. Drdl representations were generated by the "regular PCR" and 
the "touchdown PCR" using common probe MTCG228 and 3 and 4 base selection 
PCR primers CATP58, CAGP59, CATGP64, and CAGTP65. Presence of specific 

15 fragments in the representation were detected by LDR using probes specific to the 

amplified regions (Table 1 7). In the REF lane, used as the standard, were LDR results 
of PCR products generated from probes designed for each of the targeted regions in 
the human genome. The labels on the left refer to the four bases present at the Drdl 
site and the number in parenthesis represents the predicted length of the Drdl- 

20 MspllTaql fragment. The four lanes following REF lane were the LDR results of 
detecting representations generated by "regular PCR" with CAGP59, CATP58, 
CAGTP65, and CATGP64 reach in probes, respectively. The last four lanes were the 
LDR results of detecting representations generated by "touchdown PCR" with 
CAGP59, CATP58, CAGTP65, and CATGP64 reach in probes, respectively. 

25 

DETAILED DESCRIPTION OF THE INVENTION 

The present invention is directed to a method of assembling genomic 
maps of an organism's DNA or portions thereof A library of an organism's DNA is 
30 provided where the individual genomic segments or sequences are found on more 

than one clone in the library. Representations of the genome are created, and nucleic 
acid sequence information is generated from the representations. The sequence 
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information is analyzed to determine clone overlap from a representation. The clone 
overlap and sequence information from different representations is combined to 
assemble a genomic map of the organism. 

5 Summary ofPrdl island approach to accelerate alignment of clones. 

The Drdl island approach obtains a representation of the sequence in a 
genome which may be used to complete the map of the genome, to fmd mapped 
SNPs, and to evaluate genome differences and their association with diseases. 

10 The furst step of the procedure is to form a library of genomic DNA in 

cosmid, bacteriophage PI, or bacterial artifical chromosome ("BAC") clones. Each 
clone of the library is cut with a restriction enzyme into a plurality of fragments which 
have degenerate ends. Unique linkers are ligated to the degenerate ends. Internal 
sequence information in the clones may be obtained by sequencing off the linkers. 

15 This creates Ikb "islands" of sequence surrounding the restriction sites which are 
within that clone. In essence, a "representation" of the genome in the form of 
"islands" is created, but the islands are attached to random clones and hence the clone 
overlap can be determined. 

Depending on the particular restriction site used, an average of 5-8 

20 different sets of sequencing runs are performed on the random clones (and up to 16 if 
needed), creating the representations of the genome. The sequence information from 
one set (e.g., a sequencing primer ending with 3' AA) may be used to align clones 
based on an analysis of overlaps between singlet, doublet, and even triplet reads. In 
addition, a given clone contains interpretable sequence information from at least two 

25 sets, and often from all 5-8 sets. Thus, the information from different sets on the 
same clone may also be used to align clones. 

Once an overlapping map of the human genome is created, it becomes 
a powerful tool for completing the entire genomic sequence as well as identifying 
mapped SNPs. This procedure permits 100,000 SNPs to be identified by a shotgun 

30 method which immediately gives their map position. Further, these SNPs are 

amenable for use in a high throughput detection scheme which uses a universal DNA 
array. 
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I. Preparation of Genomic DNA 

In order to carry out the mapping procedure of the present invention, 
5 the genomic DNA to be mapped needs to be divided into a genomic library 

comprising a plurality of random clones. The genomic library can be formed and 
inserted into cosmid clones, bacteriophage PI vectors, or bacterial artificial 
chromosome clones ("BAC") as described in Chapters 2, 3, and 4 of Birren, et. al., 
Genomic Analysis — A Laboratory Manual Vol. 3 (Cold Spring Harbor Laboratory 

10 Press 1997), which is hereby incorporated by reference. 

When producing cosmid clones, a genomic DNA library may be 
constructed by subjecting a sample of genomic DNA to proteinase K digestion 
followed by partial enzymatic digestion with Mbol to form DNA fragments of random 
and var>'ing size of 30-50kb. Cosmid vectors with single cos sites can be digested 

1 5 with BamHl to linearize the vector followed by dephosphorylation to prevent 

religation. Cosmid vectors with dual cos sites can be digested with Xba\ to separate 
the two cosmid sites and then dephosphorylated to prevent religation. The vector and 
genomic DNA are ligated and packaged into lambda phage heads using in vitro 
packaging extract prepared from bacteriophage lambda. The resulting phage particles 

20 are used to infect an E. coli host strain, and circularization of cosmid DNA takes place 
in the host cell. 

In forming bacteriophage PI vector libraries, genomic DNA is 
subjected to partial digestion with a restriction enzyme like Saui Al followed by size 
fractionation to produce 70 to 100 kb DNA fragments with SauiAA 5' overhangs at 

25 each end. A bacteriophage PI cloning vector can be treated sequentially with the Seal 
and BamVa restriction enzymes to form short and long vector arms and 
dephosphorylated with BAP or CIP to prevent religation. The pac site can then be 
cleaved by incubation with an extract prepared by induction of a bacteriophage 
lysogen that produces appropriate bacteriophage PI pac site cleavage proteins (i.e. 

30 Stage 1 reaction). After the pac site is cleaved, the DNA is incubated with a second 
extract prepared by induction of a bacteriophage lysogen that synthesizes 
bacteriophage PI heads and tails but not pac site cleavage proteins (i.e. Stage II 



wo 00/40755 PCT/USOO/00144 

- 36- 

reaction). The genomic DN A and vector DNA are then ligated together followed by 
treatment with Stage I and, then, Stage II extract of pac site cleavage proteins. 
Unidirectional packaging into the phage head is initiated from the cleaved pac end. 
After the phage head is filled with DNA, the filled head is joined with a phage tail to 
5 form mature bacteriophage particles. The PI DNA is then incorporated into a 

bacterial host cell constitutively expressing the Cre recombinase. The phage DNA is 
cyclized at lox? sites, and the resulting closed circular DNA is amplified. 

In producing BAC libraries, genomic DNA in agarase is subjected to 
partial digestion with a restriction enzyme followed by size separation. BAC vectors 

1 0 are digested with a restriction enzyme and then dephosphorylated to prevent 

religation. Suitable restriction enzymes for digestion of the BAC vectors include 
HindlW, BamUl^ £coRI, and Sph\. After conducting test ligations to verify that clones 
with low background will be produced, the genomic DNA and BAC DNA are ligated 
together. The ligated genomic and BAC DNA is then transformed into host cells by 

1 5 electroporation. The resulting clones are plated. 

II. Drdl Island Approach 

A Single Restriction/Ligation Reaction is Used to Obtain Internal Sequences of 
20 C longg ^t PrcR gites. 

Once the individual clones are produced from genomic DNA and 
separated from one another, as described above, the individual clones are treated in 
accordance with the Drdl approach of the present invention. 

25 Figure 1 is a schematic drawing showing the sequencing of Drdi 

islands in random plasmid or cosmid clones in accordance with the present invention. 
The random plasmid or cosmid clones produced as described above are amplified. 
Nucleic acid amplification may be accomplished using the polymerase chain reaction 
process. The polymerase chain reaction process is the preferred amplification 

30 procedure and is fiilly described in H. Erlich, et. al., "Recent Advances in the 

Polymerase Chain Reaction," Scknoe 252: 1643-50 (1991); M. Innis, et, al., PGR 
Protocols: A Guide to Methods and Applications . Academic Press: New York 
(1990); and R. Saiki, el. al., "Primer-directed Enzymatic Amplification of DNA with 
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a Thermostable DNA Polymerase " Science 239: 487-91 (1988), which are hereby 
incorporated by reference. Long range PCR procedures are described in Cheng, ct al., 
"Long PCR," Nature . 369(6482):684-5 (1994) and Cheng, et aL, "Effective 
Amplification of Long Targets From Cloned Inserts and Human Genomic DNA,'' 
5 Proc Natl Acad Sci USA . 91(12): 5695-9 (1994), which are hereby incorporated by 
reference. 

In carrying out a polymerase chain reaction process, the target nucleic 
acid, when present in the form of a double stranded DNA molecule, is denatured to 
separate the strands. This is achieved by heating to a temperature of 85-1 05^C. 

10 Polymerase chain reaction primers are then added and allowed to hybridize to the 
strands, typically at a temperamre of 50-85°C. A thermostable polymerase (e.g., 
Thermus aquaticus polymerase) is also added, and the temperature is then adjusted to 
50-85°C to extend the primer along the length of the nucleic acid to which the primer 
is hybridized. After the extension phase of the polymerase chain reaction, the 

1 5 resulting double stranded molecule is heated to a temperature of SS-IOS^'C to denature 
the molecule and to separate the strands. These hybridization, extension, and 
denaturation steps may be repeated a number of times to amplify the target to an 
appropriate level. 

The amplified clones are then incubated with a Drdl restriction 

20 enzyme, a T4 ligase, and a linker at 15°C to 42*^C, preferably 37°C, for 15 minutes to 
4 hours, preferably 1 hour. As shown in Figure 1 , the Drdl restriction enzyme cuts 
both strands of the clone where indicated by the arrows and the T4 ligase couples a 
doubled stranded linker to the right hand portion of the cut clone to form a double 
stranded ligation product, as shown in Figure 1 . In the embodiment depicted, the 

25 linker has an AA overhang, but, as discussed infra, Drdl will cut any 6 bases between 
a GAC triplet and GTC triplet, leaving a 3' double base (i.e. NN) overhang. 
Therefore, the Drdl island technique of the present invention utilizes a different linker 
for each of the non-palindromic, 3' double base overhangs to be identified. 

After the different linkers are ligated to the firagments of DNA 

30 produced by Drdl digestion to form a phosphorylated site containing, in the case of 
Figure 1, a 3' AA overhang, the T4 ligase and the restriction enzyme (i.e. Drdl) are 
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inactivated by heating at 65*^C to 98°C, preferably 95"C, for 2 minutes to 20 minutes, 
preferably 5 minutes. As shown in Figure 1, a sequencing primer is contacted with 
the ligation product after it is denatured to separate its two strands. For the linker 
depicted, the sequencing primer has a 3' A A overhang and nucleotides 5' to the 
5 overhang which makes the primer suitable for hybridization to one strand of the 
ligation product. Sequencing primers adapted to hybridize to the ligation products 
formed from the other linkers arc similarly provided. With such sequencing primers, 
a dideoxy sequencing reaction can be carried out to identify the different Drdl 
cleavage sites. Dideoxy sequencing is described in Chadwick, et al., "Heterozygote 

1 0 and Mutation Detection by Direct Automated Fluorescent DNA Sequencing Using a 
Mutant Taq DNA Polymerase " Biotechniques - 20(4):676-83 (1996) and Voss, et al., 
"Automated Cycle Sequencing with Taquenase: Protocols for Internal Labeling, Dye 
Primer and 'Doublex' Simultaneous Sequencing," Biotechniques . 23(2):312-8 (1997), 
which are hereby incorporated by reference. In situations where the results of 

1 5 dideoxy sequencing with primers having a 2 base 3' end (i.e. NN) are too difficult to 
interpret due to priming three or more fragments during the sequencing reaction, 
additional selectivity can be achieved by performing 4 separate dideoxy sequencing 
reactions for each linker. For example, with respect to the linker 3' A A overhang, 
sequencing primers having 3' ends of AAA, AAC, AAG, and A AT can be utilized to 

20 obtain sequences for Drdl cleavages filled v^ith the AA-containing linker. This 
technique is amenable to automation. In cases where there is insufficient DNA 
template to conduct dideoxy sequencing, this sequencing step can be preceded by a 
PCR amplification procedure. Suitable PCR amplification conditions are described 
above. 

25 The results of the above-described sequencing procedure indicates the 

number of times a particular linker sequence is present in an individual clone. If a 
particular linker sequence appears only one time in a given clone, it is referred to as a 
unique or singlet sequence, while the presence of a particular linker sequence two 
times is referred to a doublet, three times is referred to a triplet, etc. The fi-agments 

30 with the different 2 base overhangs (e.g., AA, AC, AG, CA, OA, and GO) constitute 
representations, and the representations for different clones are then examined to 
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determine if there is any commonality (i.e. the clones overlap). Based on this 
analysis, the different clones are assembled into a genomic map. 

The enzyme Drdl (GACNNNN^NNGTC leaves a 3' NN overhang in 
the middle of 6 bases of degenerate sequence. The 16 NN sites which may be created 
5 fall into three groups - self-complementary (Group I), 6 non-complementary 

(Group II), and the other 6 non-complementary dinucleotides (Group III) as follows. 



Group I 

CG 
GO 
AT 
TA 



Group II 

AG 
AC 
CA 
GA 
AA 
GG 



GrQup ni 

CT 
GT 
TG 
TC 
TT 
CC 



10 Group I has complementary overhangs. Thus, a given linker would 

ligate to both sides of the cut site, so sequencing reactions would provide double reads 
on the same lane and would not be worth pursuing. Further, the complementary 
linkers can ligate to each other, forming primer dimers. Therefore, sites which 
generate CG, GC, AT, or TA ends will be ignored. 

1 5 Groups II and III are ideal. Linkers with unique sequences (for a 

subsequent sequencing run) ending in AG, AC, CA, GA, AA, and GG can be used in 
a first ligation reaction. Linkers ending in the other six dinucleotides (i.e. CT, GT, 
TG, TC, TT, and CC) can be used in a second ligation reaction. 

To reduce the mmiber of sequencing runs needed, sequences should be 

20 obtained from the overhang which requires linker adapters whose 3' two bases end in 
AA, AC, ACj, CA, GA, and GG. This avoids use of both linkers and sequencing 
primers which contain or end in a "T" base. Such linkers or primers are more 
susceptible to misligations or mispriming since T-G mismatches give higher rates of 
misligation or polymerase extension than any of the other mismatches. 

25 The advantage of using Drdl is that it leaves a 2 base 3' overhang on a 

split palmdrome. Thus, the product of a PCR reaction may be immediately used in a 
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Drdl restriction/ligation reaction, without requiring time consuming purification of 
the PGR fragment. Polymerase won't extend the 3' overhang ends generated by Drdl. 

Drdl sites are eliminated by ligation of the linkers, but are recreated 
and cut again if two PGR fragments are ligated together. The Drdl linker is 
5 phosphorylated so both strands ligate. Since the end is non-complementar}', it cannot 
ligate to itself. Thus, all free Drdl ends will contain linkers. 

As noted above, the linkers of Group II or Group III can used together. 
As shown in Figures 2 and 3, there are 2 schemes for separately carrying out each of 
the Drdl island sequencing procedures for each group. 

10 As shown in Figure 2, one scheme involves using a single tube or well: 

(1) to PGR amplify or partially purify DNA from individual clones from the cosmid, 
PAG, or BAG libraries; (2) to incubate with Drdl, T4 ligase, and the 6 divergent 
linkers with nonpalindromic 3' double base overhangs; and, optionally, (3) to PGR 
amplify to generate sufficient DNA template for dideoxy sequencing. At this point, 

1 5 the material to be sequenced is aliquoted into multiple (e.g., 6) tubes or wells with 
each tube or well being used to carry out one of the 6 separate sequencing reactions 
for each of the Drdl cleavage sites filled by the 6 linkers of Group II or Group III. If 
sequencing primers with an additional base are needed to overcome sequencing reads 
which are difficult to interpret (as discussed above), these primers can be added to the 

20 tube or well used to carry out the sequencing of the cleavage site for their respective 
linker. 

Figure 2 provides a scheme for sequencing representations of BAG 
clones. Two approaches may be considered for preparing DNA. One rapid approach 
is to pick individual colonies into lysis buffer and lyse cells under conditions which 

25 fragment chromosomal DNA but leave BAG DNA intact, Ghromosomal DNA is 
digested by the ATP dependent DNase from Epicentre which leaves GGC and OG 
BAG DNA intact. After heat treatment to inactivate the DNase, restriction digestion, 
ligation of linker adapters, and PGR amplification are all performed in a single tube. 
The products are then aliquoted and sequencing is performed using specific primers to 

30 the adapters. This first approach has the advantage of obviating the need to grow and 
store 300,000 BAG clones. 
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An alternative approach is to pick the colonies into 1.2 ml growth 
media and make a replica into fresh media for storage before pelleting and preparing 
crude BAC DNA from a given liquid culture similar as described above. This second 
approach has the advantage of producing more BAC DNA, such that loss of an island 
5 from PCR dropout is less likely. Further, this approach keeps a biological record of 
all the BACs, which may become useful in the future for techniques such as exon 
trapping, transfection into cells, or methods as yet undeveloped. 

As shown in Figure 3, the second scheme involves using a single tube 
or well to PCR amplify or partially purify DNA from individual clones from the 

10 cosmid, PAC, or BAC libraries. The PCR product can then be aliquoted into multiple 
(e.g., 6) tubes or wells: (1) to incubate with Drdl, T4 ligase, and the 6 divergent 
linkers with nonpalindromic 3' double base overhangs; (3), optionally, to PCR 
amplify to generate sufficient DNA template for dideoxy sequencing; and (3) to carry 
out one of the 6 separate sequencing reactions for each of the Drdl cleavage sites 

15 filled by the 6 linkers of Group II or Group III. As to step (3), if sequencing primers 
with an additional base are needed to overcome sequencing reads which are difficult 
to interpret (as discussed above), these primers can be added to the tube or well used 
to carry out the sequencing of the cleavage site for their respective linker. 

As shown in Figure 4, DNA sequencing can be carried out directly 

20 from PCR-amplified DNA without primer interference, the PCR primers from the 
original PCR reaction may be removed by using riboU containing primers and 
destroying them with either base or (using dU) with UNG. This is achieved by 
incorporating ribonucleotides directly into PCR primers. Colonies are then picked 
into microwell PCR plates. The primers containing ribose, on average every fourth 

25 nucleotide, are added. The preferred version would use r(U) in place of dT, which 
simplifies synthesis of primers. After PCR amplification, in the presence of dNTPs 
and Tag polymerase, O.IN NaOH is added and the PCR product is heated at 95*'C for 
5 minutes to destroy unused primers. Tlie PCR product is then diluted to 1/1 0th of the 
volume in 2 wells and forward and reverse sequencing primers are added to run 

30 fluorescent dideoxy sequencing reactions. 

Another approach to sequence DNA directly from PCR-amplified 
DNA uses one phosphorylated primer, lambda exonuclease to render that strand and 
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the primer single stranded, and shrimp alkaline phosphatase to remove dNTPs. This 
is commercially available in kit form from Amersham Pharmacia Biotech, 
Piscataway, NJ. A more recent approach to sequence DNA directly from PCR- 
amplified DNA uses ultrafiltration in a 96 well formal to simply remove primers and 
5 dNTPs physically, and is commercially available from Millipore, Danvcrs, MA. 

Figure 5 shows an alternative embodiment of the Drdl island 
sequencing procedure of the present invention. In this embodiment, individual BAC 
clones are cut witli the restriction enzymes Drdl and Mspl in the presence of linkers 
and T4 ligase. This is largely the same procedure as that described with reference to 

10 Figure 1 except that the Mspl restriction enzyme is utilized to reduce the length of the 
fragment to a size suitable for PGR amplification. In Figure 5, the subtleties of the 
linker-adapter ligations and bubble PGR amplification to select only the Drdl-Mspl 
fragments are detailed. As in Figure 1, the linker for the Drdl site is phosphorylated 
and contains a 3' two base overhang (e.g., a 3' AA overhang as in Figure 5). A 

15 separate linker is used for the Mspl site which replaces the portion of the BAC DNA 
to the right of the Mspl site in Figure 5. The Mspl linker is not phosphorylated and 
contains a bubble (i.e. a region where the nucleotides of this double stranded DNA 
molecule are not complementary) to prevent amplification of unwanted Mspl-Mspl 
fragments. The T4 ligase binds the Drdl and Mspl linkers to their respective sites on 

20 the BAC DNA fragments with biochemical selection assuring that most sites contain 
linkers. 

After the different linkers are ligated to the fragments of DNA 
produced by Drdl digestion to form a phosphorylated site containing, in the case of 
Figure 5, a 3' AA overhang, the T4 ligase and the restriction enzymes (i.e. Drdl and 

25 Mspl) are inactivated by heating at 65°C to 98°C, preferably 95^C, for 2 minutes to 
20 minutes, preferably 5 minutes. As shown in Figure 5, the ligation product is 
amplified using a PCR procedure under the conditions described above. For the 
linker depicted, one amplification primer has a 3' AA overhang and nucleotides 5' to 
the overhang which makes the primer suitable for hybridization to the bottom strand 

30 of the ligation product for polymerization in the 3' to 5' direction. The other 

sequencing primer, for the linker depicted in Figure 5, has a 5' CO overhang which 
makes this primer suitable for hybridization to the top strand of the ligation product 
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for polymerization in the 5' to 3' direction. Amplification primers adapted to 
hybridize to the ligation products formed from the other linkers are similarly 
provided. As described with reference to Figure 4, PGR amplification is carried out 
using primers with ribose U instead of dT, adding dNTPs and Taq polymerase, adding 
5 NaOH, and heating at 85°C to 98°C, preferably 95°C, for 2 minutes to 20 minutes, 
preferably 5 minutes to inactivate any unused primer. 

After amplification is completed and the amplification product is 
neutralized and diluted, dideoxy sequencing can be conducted in substantially the 
same manner as discussed above with reference to Figure 1 . If necessary, a separate 

1 0 dideoxy sequencing procedure can be conducted using a sequencing primer which 
anneals to the Mspl site linker. This is usefijl to generate additional sequence 
information associated with the Drdi island. 

Figure 6 shows a variation of the scheme for amplifying Drdl islands 
for sequencing directly from small quantities of BAG DNA. Individual BAG clones 

1 5 are cut v^th the restriction enzymes Drrfl, Mspl, and Taql in the presence of linkers 
and T4 ligase. This is largely the same procedure as described in Figure 5 except that 
the Mspl and Taql restriction enzymes are used to reduce the length of the firagment to 
a size suitable for PGR amplification. As in Figure 5, the linker for the Drdi site is 
phosphoryiated and contains a 3' two base overhang (e.g., a 3' AA overhang as in 

20 Figure 6). A separate linker is used for the Mspl or Taql site which replaces the 
portion of the BAG DNA to the right of the Mspl or Taql site in Figure 6. This 
MspllTaql linker is phosphoryiated, contains a 3' blocking group on the 3' end of the 
top strand, and contains a bubble to prevent amplification of unwanted Mspl-Mspl, 
Taql'Mspl, or Taql-Taql fi-agments. While the linker can ligate to itself in the 

25 phosphor>'lated state, these linker dimers will not amplify. Phosphorylation of the 

linker and use of a blocking group eliminates the potential artifactual amplification of 
unwanted Mspl-Mspl^ Taql-Mspl^ or Taql-Taql fi*agments. T4 ligase attaches the 
DrcR and MspllTaql linkers to their respective sites on the BAG DNA Segments with 
biochemical selection assuring that most sites contain linkers. The ligation product is 

30 PGR amplified using primers complementary to the linkers. After amplification is 
completed, dideoxy sequencing can be performed as described above. 
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Figure 7 describes the three levels of specificity in using the Drdl 
island approach. 

Specificity of the Drdl Linker Ligations and Subsequent Sequencing Reactions. 

The specificity of T4 thermostable DNA ligases is compared below in 

Table 3. 



10 



Table 3. Fidelity of T4 and different thermostable DNA ligases. 



C-G match at 3 '-end 



•GTC p- 
• CAG — 



T-G mismatch at 3 '-end 
F 



T-G mismatch at penultimate 3 '-end 
— F 



■GTT p 
■CAG — 



■GTC p ' 
•CGG — 



Ligasc 


Concentration 
(nM) 


Initial Rate of 
C-G match 
(ftnol/) 


Initial Rate ofT- 
G mismatch at 
3*<nd (fmol/) 


Initial Rate of T-G 

mismatch at 
pcnuhimatc 3 '-end 
(fmol/) 


Ligation 
ndelity 1^ 


Ligation 
fidelity 2** 


T4 


0.5 


1.4 X I02 


2.8 


7.1 


5.0 X 10* 


l.9x 10* 


T. th'Vfi 


K25 


5.5 X lO' 


6.5 X I0'2 


2.9 X 


8.4x 102 


l.9x 102 


T. //1-K294R 


12.5 


1.5x 10^ 


2.9x10-2 


3.8 x 10"* 


5.3 X 10^ 


4-0 x I02 


T.spAK\6D 


12.5 


1.3x 102 


2.5 X 10-2 


UxlO"* 


52x10^ 


I.l x 10^ 


Aquifex sp. 


12.5 


9.9 X 10* 


2.9x10-2 


2.6x10"' 


3.5x10^ 


3.8x 102 



15 



20 



The reaction mixture consisted of 20 mM Tris-HCI, pH 7.6, 10 mM MgCl2 , 100 mM KCl, 10 mM DTT. 1 mM NAD"^, 20 
jig/ml BSA, and 12.5 nM nicked DNA duplex substrates. T4 DNA ligase fidelity was assayed at 37 "C, Thermostable ligasc 
fidelity was assayed at 65 *C. Fluorcscently labeled products were separated on an ABI 373 DNA sequencer and quantified 
using tlic ABI GeneScan 672 software. 

a: Ligation fidelity 1= Initial Rate of C-G match / Initial Rate of T-G mismatch at 3 '-end. 

b; Ligation fidelity 2= Initial Rate of C-G match / Initial Rate of T-G mismatch ai penultimate 3 '-end. 



Both the thermostable and the T4 hgase show the highest degree of mismatch ligation 
25 for G:T or T:G mismatches. Consequently, by studying the fidehty of these reactions, 

the limits of mismatch discrimination may be determined. 

While the thermostable ligases exhibit 10 to 100-fold greater fidelity 

than T4 ligase, the later enzyme is far more efficient in ligating 2 base overhangs. 

Therefore, ligation, in accordance with the present invention, should be performed 
30 using T4 ligase. There are three degrees of specificity: (i) ligation of the top strand 

requires perfect complementarity at the 3' side of the junction; (ii) ligation of the 
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bottom strand requires perfect complementarity at the 3' side of the junction; and 
(iii) extension of polymerase off the sequencing primer is most efficient if the 3' base 
is perfectly matched. All three of these reactions demonstrate 50-fold or greater 
discrimination if the match or mismatch is at the 3' end and 20-fold or greater 
5 discrimination if the match or mismatch is at the penultimate position to the 3' end. 

How to interpret the results: 

A computer simulation was performed on 4 known sequenced BAC 

10 clones from chromosome 7q3 1 . The distribution of Drdl sites in these clones and 

their overhangs is shown in Figures 8-11. There are 38 non-palindromic Drdl sites in 
about 550 kb of DNA, or an average of 1 non-palindromic Drdl site per 1 5 kb. 

The average 30-40 kb clone should be cut about three times with Drdl 
to generate non-palindromic ends. Again, palindromic ends are discounted, so the 

1 5 average clone needs to be a little bigger to accommodate the extra silent cuts and still 
get an average of 3 non-palindromic cuts. It should be noted, however, that as long as 
there are 2 or more Drdl sites which are singlets (i.e. present once in the clone) or 
doublets (present twice in the clone) in all of the clones to be aligned, such alignment 
can be successfully achieved. In the best case scenario, each of the overhangs is 

20 unique (i.e. a singlet), so 6 unique sequencing runs are generated, and these are 

connected in matched pairs (i.e. the sequence generated from the primers ending in 
AA is connected to the sequence generated from primers ending in TT), so about 3 x 
Ikb ''Drdl islands" of sequence are somewhere within the 30-40kb flanked by the two 
500-800 base-pair anchors. 

25 Now if two random 30-40kb clones overlap, the chances are excellent 

that they will either run into each other on the ends, or, alternatively, I to 3 of the 
internal sequences will be identical. There will be a few case where two clones 
overlap and different internal Ikb sequences are obtained, because there is a small 
probability of having a Drdl polymorphism. However, these will simply add to the 

30 density of sequence which may run into or overlap with existing markers. 

As shown in Figure 8, use of the Drdl approach in mapping the Met 
Oncogene in a BAC clone from the 7q31 chromosome identifies 12 Drdl sites within 
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the 1 71,905 bp shown. The overhangs and complements shown in the positions set 
forth in Figure 8 are based on the known sequence in GenBank. More particularly, 
there are TC and CA singlets and GG, GT, CT, and TT doublets (either in the 
overhang or its complement) for the Drdl islands. Since the sum of singlets and 
5 doublets is greater than or equal to 2, this fingerprint for the Met Oncogene in a BAG 
clone can be used to determine the positional relationship of this clone with respect to 
other clones in the library as described infra. 

Figure 1 0 shows how the Drdl approach is used in mapping the HMG 
gene in a BAG clone from the 7q31 chromosome. Within the 165,608 bp shown, 

1 0 there are 1 1 Drdl sites with the known sequences used to identify the overhangs and 
complements in the positions set forth in Figure 10; More particularly, there are TT. 
GT, and GA singlets and CT and GG doublets (either in the overhang or its 
complement) for the Drdl islands. Since the sum of singlets and doublets is greater 
than or equal to 2, this fingerprint for the Met Oncogene in a BAG clone can be used 

15 to determine the positional relationship of this clone with respect to other clones in the 
library, as describe infra. 

Figures 1 2 shows the use of the Drdl approach in mapping the Pendi-in 
gene in a BAG clone from the 7q3 1 chromosome to identify 1 0 Drdl sites within the 
97,943 bp shown. The overhangs and complements shown in the positions set forth in 

20 Figure 12 are based on the known sequence in GenBank. Specifically, there are 3 

singlets (i.e. CC, TT, and GA), 1 doublet (i.e. AA), and 1 multiplet (i.e. CT) (either in 
the overhang or its complement) for tlie Drdl islands. Since the sum of singlets and 
doublets is greater than or equal to 2, this fingerprint for the Pendrin gene in a BAG 
clone can be used to determine the positional relationship of this clone with respect to 

25 other clones in the library, as described infra. 

Figure 14 shows how the Drdl approach is used in mapping the 
alpha2(I) gene in a BAG clone fi-om the 7q3 1 chromosome. There are 1 1 Drdl sites 
within the 11 6,466 bp with the known sequences used to identify the overhangs and 
complements shown in the positions set forth in Figure 14. There are 2 singlets (i.e. 

30 AG and GG) and 4 doublets (i.e. AA, TG, GT, and TC) (either in the overhang or its 
complement) for tlie Drdl islands. Since the sum of singlets and doublets is 2 or 
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greater, this fingerprint for the alpha2(I) gene can be used to determine the positional 
relationship of this clone with respect to other clones in the library, as described infra. 

Two special cases need to be considered : 

5 

In the first case, the clone has no internal Drdl sites with non- 
palindromic ends. This will occur on occasion. Again, computer analysis on the four 
fully sequenced BAC clones (about 550 kb of DNA) showed two areas which would 
leave gaps in the cosmid contigs. This does not preclude overlapping such clones to 

10 larger superstructures (i.e. BACs and YACs). 

The solution to this problem is to use a second enzyme with a 
comparable frequency in the human genome. By slightly modifying the procedure, 1 6 
linker/primer sets may be used on split palindrome enzymes which generate a 3 base 
3' overhang. Since the overhang is an odd number of bases, it is not necessary to 

15 exclude the palindromic two base sequences AT, TA, GC, and CG. To reduce the 
number of ligations fi-om 64 (all the different possible 3 base overhangs) to 16, the 
linkers and primers are degenerate at the third position, i.e. end with NTC or NGC. 
As noted above, since there are 3 levels of specificity in the ligation and sequencing 
step, the third base degeneracy will not interfere with the fidelity of the reaction. With 

20 3 base overhangs, multiplet sequences which are difficult to interpret may be teased 
apart by either: (i) using linkers and primers which lack the 3^^ base degeneracy, or 
(ii) using sequencing primers which extend an extra base on the 3' end of the primer. 

Of the 4 commercially available split palindrome enzymes which 
generate a 3 base 3' overhang, BgR (GCCNNNN'^NGGC) and Dralll 

25 (CACNNN^TG) are present at low enough frequencies to be compatible v^th Drdl. 
There are 60 BgH sites in about 550 kb of the four sequenced BAC clones, or an 
average of 1 Bgll site per 9 kb. The frequency of the other split palindrome enzymes 
in human DNA are: DralU (1 per 8 kb), Alwnl (1 per 4 kb), and PJMl (1 per 3 kb). 

Although there are some type lis enzymes which will allow the same 2 

30 base overtiang 3' ligation, they are not split palindromes and hence simultaneous 
cutting and ligation will only provide the sequence from one side. This can be an 
advantage for some enzymes, as described for Sapl below. 



f 
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Figures 8, 10, 12, and 14 show how the enzyme BgfL can generate a 3 
base 3' overhang which can be used in accordance with the present invention. 

Figure 8 shows the use of the Bgll approach in mapping the Met 
Oncogene in a BAG clone from the 7q31 chromosome. There are 16 Bgfl sites within 
5 the 171,905 bp shown with known sequences used to identify the overhangs and 
complements. More particularly, there are 5 singlets (i.e. the CT, TT, TG, TG, and 
GO overhangs) and 5 doublets (i.e. the TA, GO, CG, GA, and AG overhangs) (either 
in the overhang or its complement) for the BgH islands. Since the sum of the singlets 
and doublets is greater than or equal to 2, this fingerprint for the Met Oncogene in a 

10 BAG clone can be used to determine the positional relationship of this clone with 
respect to other clones in the library, as described infra.. 

Figure 10 shows the use of the Bgll approach in mapping the HMG 
gene in a BAG clone from the 7q31 chromosome. Within the 165,608 bp shown, 
there are 12 BgFl sites with known sequences used to identify the overhangs and 

15 . complements in the positions set forth in Figure 9. Specifically, there are 5 singlets 
(i.e. the GT, AA, AG, GG, and GG overhangs) and 4 doublets (i.e. the AG, TG, TT, 
and GA overhangs) (either in the overhang or its complement) for the Bgll islands. 
Since the sum of the singlets and doublets is greater than or equal to 2, diis fingerprint 
for the Met Oncogene in a BAG clone can be used to determine the positional 

20 relationship of this clone with respect to other clones in the library, as described infra. 

Figure 12 shows the use of the Bgll approach in mapping the Pendrin 
gene in a BAG clone firom the 7q31 chromosome to identify the 17 Bgll sites vwthin 
the 97,943 bp shovm. The overhangs and complements shown in the positions set 
forth in Figure 10 are based on known sequences. Specifically, there is 1 singlet (i.e. 

25 the TG overhang) and 5 doublets (i.e. TA, GT, GG, TT, and AA overhangs) (either in 
the overhang or its complement) for the Bgll islands. Since the sum of the singlets 
and doublets is greater than or equal to 2, this fmgerprint for the Pendrin gene in a 
BAG clone can be used to determine the positional relationship of this clone with 
respect to other clones in the library, as described infra. 

30 Figure 14 shows how the use of the Bgll approach is used in mapping 

the alpha2(I) gene in a BAG clone from the 7q3 1 chromosome. There are 1 5 Bgll 
sites within the 1 16,466 bp with known sequences used to identify the overhangs and 
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complements shown in the positions set forth in Figure 1 1 . There are 4 singlets (i.e. 
tlie AA, TT, GC, and GG overhangs) and 7 doublets (i.e. the TA, GA, CG, TC, AA, 
CC, and AC overhangs) (either in the overhang or its complement) for the Bgll 
islands. Since the sum of the singlets and doublets is greater than or equal to 2, this 
5 fingerprint for the alpha2(I) gene can be used to determine the positional relationship 
of this clone with respect to other clones in the library, as described infra. 

Similarily, Figui'es 9, 1 1, 13, and 15 show how the enzyme Sap] can 
also generate 3 base 3' overhangs in accordance with the present invention. Figure 16 
is a schematic drawing showing the sequencing of Bgll islands in random BAG 

10 clones in accordance with the present invention. This is largely the same as the 

embodiment of Figure 7, except that a different enzyme is used. In this embodiment, 
individual BAG clones are cut with the restriction enzymes Bgll and Mspl in the 
presence of linkers and T4 ligase. As in Figure 7, the linker for the Bgll site is 
phosphorylated and contains a 3' three base overhang (e.g., a 3' NAG overhang). A 

1 5 separate linker is used for the Mspl site which replaces the portion of the BAG clone 
DNA to the right of the Mspl site in Figure 7. The Mspl linker is not phosphorylated 
and contains a bubble (i.e. a region where the nucleotides of this double stranded 
DNA molecule are not complementary) to prevent amplification of unwanted Mspl- 
Mspl fragments. The T4 ligase binds the BgR and Mspl linkers to their respective 

20 sites on the BAG clone DNA with biochemical selection assuring that most sites 
contain linkers. 

After the different linkers are ligated to the fragments of DNA 
produced by Bgll digestion to form a phosphorylated site containing, in the case of 
Figure 16, a 3' NAG overhang, the T4 ligase and the restriction enzymes (i.e. Bgll and 

25 Mspl) are inactivated at 65'*C to 98^G, preferably 95°G, for 2 minutes to 20 minutes, 
preferably 5 minutes. As shown in Figure 16, the ligation product is amplified using a 
PGR procedure under the conditions described above. For the linker depicted, one 
amplification primer has a 3' AG overhang and nucleotides 5' to the overhang which 
makes the primer suitable for hybridization to bottom strand of the ligation product 

30 for polymerization in the 3' to 5' direction. Amplification primers adapted to 
hybridize to the ligation products formed from the other linkers are similarly 
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provided. As described with reference to Figure 6, PGR amplification is carried out 
using primers with ribose U instead of dT, adding dNTPs and Taq polymerase, adding 
NaOH, and heating at SS'^C to 98°C, preferably 95°C, for 2 minutes to 20 minutes, 
preferably 5 minutes to inactivate any unused primer. 
5 After amplification is completed and the amplification product is 

neutralized and diluted, dideoxy sequencing can be conducted in substantially the 
same manner as discussed above with reference to Figure 1. If necessary, a separate 
dideoxy sequencing procedure can be conducted using a sequencing primer which 
anneals to the Mspl site linker. This is useful to generate additional sequence 

1 0 information associated with the fig/I island. 

Another departure from the schematic of Figure 5 is that, in the scheme 
of Figure 16, a separate linker ligation procedure is carried out with the portion of the 
BAG clone on the left side of Figure 16. The primer utilized in this procedure is 
phosphorylated and ends with a 3' NTA overlap sequence. 

1 5 Figure 1 7 is a schematic drawing showing the sequencing of Sapl 

islands in random BAG clones in accordance with the present invention. This is 
largely the same as the embodiment of Figure 5, except that a different enzyme is 
used. In this embodiment, individual BAG clones are cut with the restriction enzymes 
Sapl and Mspl in the presence of linkers and T4 ligase. As in Figure 5, the linker for 

20 the Sapl site is phosphorylated and contains a 3' three base overhang (e.g., a 3 ' NUG 
overhang). A separate linker is used for the Mspl site which replaces the portion of 
the BAG DNA to the right of the Mspl site as in Figure 5. The Mspl linker is not 
phosphorylated and contains a bubble (i.e. a region where the nucleotides of this 
double stranded DNA molecule are not complementary) to prevent amplification of 

25 unwanted Mspl-Mspl fragments. The T4 ligase binds the Sapl and Mspl linkers to 
their respective sites on the BAG DNA with biochemical selection assuring that most 
sites contain linkers. 

After the different linkers are ligated to the fragments of DNA 
produced by Sapl digestion to form a phosphorylated site containing, in the case of 

30 Figure 5, a 3' NUG overhang, the T4 ligase and the restriction enzymes (i.e. Sapl and 
Mspl) are inactivated at SS^'C to 98°G, preferably 95'*G, for 2 minutes to 20 minutes, 
preferably 5 minutes. As shown in Figure 1 5, the ligation product is amplified using a 
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PCR procedure under the conditions described above. For the Hnker depicted, one 
amplification primer has a 3' NTG overhang and nucleotides 5' to the overhang 
which makes the primer suitable for hybridization to the bottom strand of the ligation 
product for polymerization in the 3' to 5' direction. The other sequencing primer, for 
5 the linker depicted in Figure 17, has a 5' CA overhang which makes this primer 
suitable for hybridization to the top strand of the ligation product for polymerization 
in the 5' to 3' direction. Amplification primers adapted to hybridize to the ligation 
products formed from the other linkers are similarly provided. As described with 
reference to Figure 4, PCR amplification is carried out using primers with ribose U 

10 instead of dT, adding dNTPs and Tag polymerase, adding NaOH, and heating at 85°C 
to 98°C, preferably 95*^0, for 2 minutes to 20 minutes, preferably 5 minutes to 
inactivate any unused primer. 

After amplification is completed and the amplification product is 
neutralized and diluted, dideoxy sequencing can be conducted in substantially the 

1 5 same manner as discussed above with reference to Figure 1 . If necessary, a separate 
dideoxy sequencing procedure can be conducted using a sequencing primer which 
anneals to the Mspl site linker. This is useful to generate additional sequence 
information associated with the Sapl island. 

In a second case, the clone has two Drdi sites with the same 3' 

20 overhangs. Thus, the sequencing reads have two bases at each position. The 
probability of NOT having an overlap is 6/6 x 5/6 x 4/6 = 20/36 = 0.55. So the 
probability of having an overlap is 1- 0.55 = 0.45, or about every other clone. At first 
glance, this is may appear to cause a problem, but, in fact, it is very useful. Rather 
than discarding these reads, on average every 4th base will be the same in both reads 

25 and, thus, clearly distinguishable. Thus, a read of this form will be entered into the 

database as such: G — A C— C — T — AA T, etc. The current computer 

programs which look for overlap examine 32 bases at a time, which is essentially 
unique in the genome, so the first 128 bases of a double-primed sequencing run 
creates a unique "signature". This can be checked against the existing sequences in 

30 the database as well as against the Dr^fl sequences generated from other clones. It 
v^U line up either with a single read (i.e. when only one of the sites overlaps) or as an 
identical double read (i.e. when both sites overlap). It is reasonably straightforward to 
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do a "subtraction" of one sequence from the double sequence to obtain the "hidden" 
sequence. 

Evaluation of the BAC clones reveals a few instances where the same 
overhang would appear in two Drdl sites from neighboring random 30-40kb clones. 
5 This requires that additional neighboring clones are found in a larger contig. If a 
region remains intractable to analysis, because there are too many Drdl sites with the 
identical overhangs, alternative enzymes Bgfl and DralU may be used. A second 
solution to sequencing reads which are difficult to interpret is to use four separate 
sequencing reactions with primers containing an additional base on the 3' end, as 
1 0 depicted at the bottom of Figure 1 . 

One advantage about generating Drdl islands is the format of the data. 
The sequence information always starts at the same position. Thus, the computer 
programs can be vastly simpler than previous lineup algorithms. A computer program 
sets up bins to score identity. For example: 

15 

SEQ. ID. No. 1. 

GATTCGATCGTAGCGTGTAGCAAGTAGCTAATTCGATCCA 

I 

GATTCGATCGTAGCGTGTAACAAGTAGCTAATTCGATCCA 
20 SEQ. ID. No. 2. 

i.e. 39/40 match, score as an overlap (with an SNP at position 20). 

Further simplifying the computer analysis, sequence information in the 
25 Drdl analysis is generated in 12 separate sets, corresponding to each overhang, and 
these sets are virtually exclusive. The probability of having a polymorphism right at 
the 2 base 3' overhang is very small (about 2 in 1,000), and, even if the polymorphism 
does occur, it will make two sequences jump to new bins, making it very easy to 
double-check existence of such polymorphisms. 
30 The above scheme has a built in redundancy, because each forward 

sequence on a Drdl site is matched to a reverse sequence. It may be more cost 
effective to ligate primers which give only one sequence read off a Drdl site. The 
above example just doubles the probability of obtaining a sequence which overlaps 
with either known STS's or with the two 500 base-pair sequences from the end of the 
35 clone. 
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IIL Singlet and Doublet Drdl Island Approach 

Extending the Drdl island approach to allow for alignment of B AGs. 

5 

On average, a given BAG will contain 2-3 unique sequences (called 
"singlets"), 2-4 sequences which are the consequence of two overlapping runs (called 
"doublets") and 0-1 sequences which are the consequence of three or more 
overlapping runs, which may be un-interpretable multisequences. In order to 

10 construct BAG clone overlaps, it is necessary to have at least two readable (doublets 
or singlets) sequencing runs for a given BAG. 

The probabilities of obtaining two readable sequencing runs from a 
BAG clone containing from 2 to 20 Drdl sites are as follows. 

A given restriction site may appear multiple times in a given BAG 

1 5 clone. Therefore, it is necessary to determine the frequency of unique and doubly 
represented restriction sites in a BAG clone. Sites which appear only once in a BAG 
clone will generate a clean sequence and will be called singlets in the calculations. 
Sites which appear exactly tv^ce should still reveal useful sequencing data once every 
four bases on average and will be known as'doublets in the calculations. 

20 The Drdl enzyme generates a degenerate 2 base 3' overhang. After 

eliminating palindromic sequences for the degenerate positions, there are 6 different 
overhangs which can be ligated after digestion of a BAG with Drdl. 

The Sapl and Bgll enzymes generate degenerate 3 base 5* and 3' 
overhangs, respectively. 16 possible tails can be picked to ensure specific ligation 

25 and to simplify the complexity of the sequencing reactions. 

Below is an analysis of the possible ways that these restriction enzyme 
sites can be distributed in BAG clones containing between 1 and 36 restriction sites. 
From the representative BAG clones, the (non-palindromic overhang) Drdl site 
appears from 8-10 times, the BgR site appears from 12-17 times, and the Sapl site 

30 appears from 12 to 25 times in human DNA. Note that the Bgll site is used on both 
sides of the cut, so for the calculations below, one doubles the number of Bgll sites in 
the BAG when calculating "N". 
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The probability of each site is p = 1/n where n = 6 for Drdl and n = 16 for Sapl or 
BglL 

For a given restriction sequence R, the probabiHty of a given site not being R is q. 
5 q=l-p 

= I - 1/n. 

The probability of all N sites in a given BAG not being the sequence R is P(absent) = 

10 

The probability of R appearing once and only once in N sites in a given BAG is: 

P(singIet)-pxq^**xN 

1 5 The probability of R appearing twice and only twice in N sites in a given BAG is: 

P(doublet) = p^ X q^*^* x Comb(N,2) 
= p'xq^-^^x(N)(N-l)/2 

20 Where Comb(N,n) is the number of ways that n items can be picked from a set of N 
available items. 

The probability that at least one of the 6 possible Drdl sites is a singlet: 
P(at least one singlet) = 1 - (1 -P(singlet))^ 

25 

The probability that at least one of the 16 possible Sapl or BgH sites is a singlet: 
P(at least one singlet) = 1 - (l-P(singlet))*^ 

The probability that at least one of the 6 possible Drdl sites is either a singlet or a 
30 doublet is: 

Psd = P(singlet) + P(doublet) 

P(at least one singlet or doublet) = 1 - (1-Psd)^ 

The probability that at least one of the 16 possible Sapl or Bgll sites is either a singlet 
35 or a doublet is: 

Psd = P(smglet) + P(doublet) 

P(at least one singlet or doublet) = 1 - (1-Psd)'^ 

The probability of one and only one singlet or doublet for Drdl is: 
40 P(exactly one singlet or doublet) = 6 x Psd x (1 - Psd)^ 
P(exactly one singlet) = 6 x P(singlet) x (1 - P(singlet))^ 

The probability of one and only one singlet or doublet for Sap] or Bgll is: 
P(exactly one singlet or doublet) = 16 x Psd x (1 - Psd)*^ 
45 P(exactly one singlet) = 16 x P(singlet) x (1 - P(singlet))*^ 
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For the BAG clones to be informative for constructing overlapping contigs, one needs 
at least two readable sequences per clone. Calculations arc provided for at least two 
singlets or doublets, or the more stringent requirement of at least two singlets. 

The probability of at least two singlets or doublets for Drdl is: 

P(at least two singlets or doublets) = P(at least one singlet or doublet) - P(exactly one 
singlet or doublet) 

= 1 - (l-Psd)^ - 6 X Psd X (1 - Psd)^ 



1 0 The probability of at least two singlets for Drdl is 



P(at least two singlets) = P(at least one singlet) - P(exactly one singlet) 
= 1 - (l-P(singlet))^- 6 x P(singlet) x (1 - P(sin 



iglet))^ 



The probability of at least two singlets or doublets for Sapl or Bgll is: 

P(at least two singlets or doublets) = P(at least one singlet or doublet) - P(exactly one 

singlet or doublet) 

= 1 - (1-Psd)*^ - 16 X Psd X (1 - Psd)*^ 

20 

The probability of at least two singlets for Sapl or BgH is: 

P(at least two singlets) = P(at least one singlet) - P(exactly one singlet) 

= 1 - (l-P(singlet))'^ - 16 x P(singlet) x (1 - P(singlet))*^ 

25 

(Note: For small values, the charts below are not completely accurate.) 
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Using these equations, for Drdl the probabiHties are: 



N 


P(absent) 


P(singlet) 


P(doublet) 


P(sd) 


P{at least two 
singlets or 
doublets) 


P(at least 
two single 


1 


0.83333 


0.16567 


0.00000 


0.16667 


0.26322 


0.26322 


2 


0.69444 


0.27778 


0.02778 


0.30556 


0.59175 


0.53059 


3 


0.57870 


0.34722 


0.06944 


0.41667 


0.79174 


0.67559 


4 


0.48225 


0.38580 


0.11574 


0.50154 


0.89207 


0.74399 


5 


0.40188 


0.40188 


0.16075 


0.56263 


0.93897 


0.76963 


6 


0.33490 


0.40188 


0.20094 


0.60282 


0.96032 


0.76963 


7 


0.27908 


0.39071 


0,23443 


0.62514 


0.96946 


0.75200 


8 


0.23257 


0.37211 


0.26048 


0.63259 


0.97213 


0.72083 


9 


0.19381 


0.34885 


0.27908 


0.62793 


0.97048 


0.67876 


10 


0.16151 


0.32301 


0.29071 


0.61372 ■ 


0.96501 


0.62813 


11 


0.13459 


0.29609 


0.29609 


0.59219 


0.95532 


0.57134 


12 


0.11216 


0.26918 


0.29609 


0.56527 


0.94059 


0.51093 


13 


0,09346 


0.24301 


0.29161 


0.53461 


0.91981 


0.44939 


14 


0.07789 


0.21808 


0.28351 


0.50159 


0.89211 


0.38901 


15 


0.06491 


0.19472 


0.27260 


0.46732 


0.85690 


0.33166 


16 


0.05409 


0.17308 


0.25962 


0.43270 


0.81412 


0.27875 


17 


0.04507 


0.15325 


0.24520 


0.39845 


0.76430 


0.23117 


18 


0.03756 


0.13522 


0.22987 


0.36509 


0.70850 


0.18936 


19 


0.03130 


0.11894 


0.21410 


0.33304 


0.64826 


0.15335 


20 


0.02608 


0.10434 


0.19824 


0.30258 


0.58537 


0.12290 


21 


0.02174 


0.09129 


0.18259 


0.27388 


0.52173 


0.09756 


22 


0.01811 


0.07970 


0.16737 


0.24707 


0.45911 


0.07677 


23 


0.01509 


0.06944 


0.15276 


0.22220 


0.39906 


0.05994 


24 


0.01258 


0.06038 


0.13887 


0.19925 


0.34280 


0.04646 


25 


0-01048 


0.05241 


0.12579 


0.17820 


'0.29121 


0.03578 


26 


0.00874 


0.04542 


0.11356 


0.15899 


0.24480 


0.02739 


27 


0.00728 


0.03931 


0.10221 


0.14152 


0.20376 


0.02085 


28 


0.00607 


0.03397 


0.09172 


0.12569 


0.16805 


0.01580 


29 


0.00506 


0.02932 


0.08210 


0.11142 


0.13742 


0.01192 


30 


0.00421 


0.02528 


0.07330 


0.09858 


0.11148 


0.00896 


31 


0.00351 


0.02177 


0.06530 


0.08706 


0.08977 


0.00670 


32 


0.00293 


0.01872 


0.05804 


0.07677 


0.07180 


0.00500 


33 


0.00244 


0.01609 


0.05149 


0.06758 


0.05706 


0.00372 


34 


0.00203 


0.01381 


0.04559 


0.05940 


0.04509 


0.00276 


35 


0.00169 


0.01185 


0.04029 


0.05214 


0.03544 


0.00204 


36 


0.00141 


0.01016 


0.03555 


0.04571 


0.02771 


0.00151 
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Using these equations, for Sapl or BgH the probabilities are: 



N 


P(absent) 


P(singlet) • 


P(doublet) 


P(sd) 


P(at least two P(at least 
singlets or two single 
doublets) 


1 


0.93750 


0.06250 


0.00000 


0.06250 


0.2641 1 


A ORa 1 1 


2 


0.87891 


0.11719 


0.00391 


0.12109 




A «;7^PA 


3 


0.82397 


0.16479 


0.01099 


0.17578 






4 


0.77248 


0-20599 


0.02060 


0.22659 




A 0714*; 
u.or mo 


5 


0.72420 


0.24140 


0.03219 


0.27359 


0.95777 




6 


0.67893 


0.27157 


0.04526 


0.31684 


0.98104 




7 


0.63650 


0.29703 


0.05941 


0.35644 




0.97240 


8 


0.59672 


0.31825 


0.07426 


0.39251 


0.99610 


rt QA1 '^fi 


9 


0.55942 


0.33565 


0.08951 


0.42516 






10 


.0.52446 


0.34964 


0.10489 


0.45453 






11 


0.49168 


0.36057 


0.12019 


0.48076 




0 99217 


12 


0.46095 


0.36876 


0.13521 


0.50397 


n QQQ77 




13 


0-43214 


0.37452 


0-14981 


0.52433 


n QQQfty 

U.9990r 




14 


0-40513 


0.37812 


0.16385 


0.54198 


rt QQQQ'^ 




15 


0.37981 


0.37981 


0.17725 


0.55706 






16 


0.35607 


0-37981 


0.18991 
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Graphs showing the probabilities of two or more singlets or doublets of Drdl, Sapl, or 
BgH sites in BACs containing from 2 to 36 sites are shown in Figure 17A. 

For the average of 8-12 non-palindromic Drdl sites per BAC clone, the 
probability is from 94%-97% of containing at least two readable (singlet or doublet) 
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sequences. For the same clones, from 5 1%-72% will contain at least two singlet 
sequences, making alignment even easier for those clones. 

Thus, the overwhelming majority of BAC clones will contain at least 
two readable (doublets or singlets) sequencing runs. Contigs may be constructed off 
5 Drdl doublet sequencing runs since two doublet runs may be used to determine BAC 
overlap, even if individual singlet sequences are unknown. Further, since the BAC 
library will represent a 5-fold coverage of the genome, sequences which were buried 
within three overlapping runs in one BAC clone will be represented as either singlets 
of doublets in neighboring BAC clones. Surprisingly, the doublet data will even 
10 allow for mapping virtually all Drdl islands onto the BAC clones. 

How to collect the data: 

In tlie past "Gemini proteins" (i.e. proteins with duplicated domains) 
15 were constructed. When using a sequencing primer which hybridizes to the 

duplicated region, one obtains a sequencing run udth a single read which turns into a 
double read as the sequencing reaction extends past the duplicated region. Bands 
were clearly visible for both sequences and the precise sequence could be determined 
by subtracting the "known" sequence from the doublet sequence. New automated 
20 DNA sequencing machines give excellent peak to peak resolution and would be able 
to read doublet and even triplet sequences for hundreds of bases. 

How to interpret the results: 

25 A computer simulation was performed on 4 known sequenced BAC 

clones from chromosome 7, and each clone generated at least 5 readable sequences. A 
computer simulation of Drdl site sequences was performed on the first 5 such sites in 
BAC RG253B13. The first 80 bp of sequence fi-om each of these positions was 
compared for either "concordant" or "discordant" alignment tests for a doublet 

30 sequence. 

To understand the power of aligning Drdl sites, it is important to 
realize there are only about 200,000 to 300,000 Drdl sites in the himian genome. 
Further, since these are being sequenced in 6 different sets, there are about 35,000 to 
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50,000 Drdl sites in a given set. Thus, to distinguish a given sequence from others, ii 
must be unique at only one in 50,000 (not one in 3 biUion) sites. 

A key advantage for generating Drdl islands is the format of the data. 
The sequence information always starts at the same position. The GTC half of the 
5 Drdl site is retained in the sequencing read, thus assuring that the sequences are 
always aligned correctly (see e.g. Figure 18 where sequences 1, 2, 3, 4, and 5 (i.e. 
SEQ. ID. Nos. 3, 4, 5, 6, and 7, respectively) are aligned at the GTC motiO- All the 
sequences have the same orientation. There is no need to compare multiple 
alignments or try the reverse sequence for alignment. Thus, computer programs can 

10 be vastly simpler than previous lineup algorithms. 

When comparing two singlet sequences, the uniqueness is determined 
for any stretch of 8 bases (i.e. 4^ = 65,536). When comparing a doublet sequence with 
a singlet sequence, the uniqueness may be determined either (1) by scoring identity at 
8 bases in the doublet sequence with the singlet sequence (represented by vertical bars 

1 5 (i.e. I) in Figure 18), or (2) by scoring 16 bases (i.e. 2^^=65,536) where the singlet 
sequence is consistent with either of the bases in the doublet at that position 
(represented by a comma in Figure 1 8 (i.e. , ). 

For example, in Figure 1 8, when analyzing the doublet to singlet 
concordant sequences, the vertical line (i.e. j ) indicates identity where the 

20 corresponding base for the doublet and for the singlet are all the same. On the other 
hand, the comma (i.e. , ) indicates consistency in that one of the bases in the doublet is 
the same as the corresponding base in the singlet. In this example, there is 
concordance (i.e . the sequences must match), because the number of bases, aside 
from the GTC motif, which are identical (i.e. 12) is greater than 8 and which are 

25 consistent (i.e. 63) exceeds 16. On the other hand, with regard to the doublet to 

singlet discordant sequences, there are no vertical lines (i.e. | ) or commas (i.e. , ) and, 
as indicated by the Xs, there are numerous bases where neither base from the doublet 
can match the corresponding base in the singlet. As a result, the doublet and the 
singlet caxmot be from the same clone (i.e. they are discordant). 

30 When comparing a doublet read to another doublet read, the sequences 

will contain a shared concordant read if there are at least 16 bases where either 
doublet sequence has an identical base which is consistent with one or the other of the 
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two bases represented in the other doublet sequence. For example, in the concordance 
comparison of a doublet in a first clone to a doublet in a second clone of Figure 18, 
the vertical line (i.e. | ) indicates identity where both bases of one doublet are the 
same as one corresponding base in the other doublet. On the other hand, the comma 
5 (i.e. , ) indicates consistency in that there are 2 different corresponding bases in one 
doublet which ai*e the same as the corresponding bases in the other doublet. For 
example, in Figure 18, there is concordance, because, aside from the GTC motif, the 
number of bases with identity (i.e. 26) (as indicated by | ) added to tiie number of 
bases with consistency (i.e. 17) (as indicated by a comma) (i.e. 26 + 17 = 43) exceeds 

10 16. Turning to doublet to doublet analysis for discordance in Figure 1 8, there are no 
vertical lines or commas, but, at several bases, there are Xs, indicating that neither 
base from one doublet matches a corresponding base from the other doublet. This is, 
perhaps, the most striking example of the power of this approach in that it easily 
shows if two multiple bases do not overlap. In a random comparison of a doublet and 

15 a singlet sequence, there are only 3 positions which are identical (|), and 38 which are 
discordant (X). When comparing different doublets with one another, there are 12 
discordant sites where one doublet has a single base (X), and 5 discordant sites where 
all four bases were present (two from one doublet, two from the other doublet; x). For 
simplicity, positions where more than two bases are read will not be considered, even 

20 though those positions are still informative. 

Figure 18 also shows doublet to triplet analyses for concordant and 
discordant sequences. These procedures are carried out in substantially the same 
fashion as the doublet to doublet analysis described above. However, the vertical line 
(i.e. I ) now indicates identity where both bases of one doublet are the same as one 

25 corresponding base in the triplet or all bases of the triplet are the same as one 

corresponding base in the doublet. On the other hand, the comma (i.e. , ) indicates 
consistency in that there are 2 different corresponding bases in tlie doublet which are 
the same two of the corresponding bases in the triplet. 

Again, the sequences will contain a shared concordant read if there are 

30 at least 16 cases where either doublet or triplet sequence has an identical base which 
is consistent with one or the other of the two bases represented in the other sequence. 
For example, in the alignment of cordant sequences for tlie doublet to triplet in 
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Figure 18, there are 12 such positions in the first 80 bp. However, there are also 14 
positions where the two reads have the same two bases at that positions, bringing the 
total concordant positions to 26, well in excess of the 16 positions required. 
Comparing a doublet with a triplet yielded 1 1 discordant sites. The prediction is that 
5 one SNP will be observed every 1,000 bases, so single base discordance representing 
SNPs will be rare but also easily distinguished from the average of 10 to 40 
discordant sites when comparing doublets with triplet, doublet, and singlet sequences. 

Thus, in as few as 80 bases of sequence, one can easily discern if there 
is a common or discordant Drdl sequence within the two reads which are being 
10 compared, when the two reads contain a singlet, doublet, or even a triplet. 

Using smaller representational fragments as an alternative approach to alignment of 
BAC$ 

1 5 The previous section described an approach to interpret singlet, 

doublet, and triplet sequences generated from representations of individual BAC 
clones using as few as 80 bases of sequence information. The assumption was made 
that when more than one fragment is generated from a given representation (i.e. Drdl 
site AA overhang), then those fragments would be present in about equal amounts. 

20 Further, the above approach requires specialized software to interpret a sequencing 
read where more than one base is called at a given position. As an alternative to 
deconvoluting doublet and triplet sequencing runs, other enzymes may be used to 
create short representational fragments. Such fragments may be differentially 
enriched via ultrafiltration to provide dominant signal, or, alternatively, their differing 

25 length provides unique sequence signatures on a full length sequencing run, such that 
unique sequences for more than one fragment can be interpreted on a single 
sequencing lane. 

For human DNA within BACs, Msel can be substituted for Mspl/Taql, 
resulting in generation of much shorter representational fragments (Figure 19 and 
30 Figure 20). Bubble linkers for Mspl/Taql on one hand and for Msel on the other hand 
are disclosed in Table 4. 
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Table 4. New Mspl/Taql and Msel bubble linkers. 

New Mspl/Taql linkers 

5 MTCG225 5' GAC ACG TCA CGT CTC GAG TCC TA 3' 

{SEQ. ID. No. 8) 

MTCG0326R 3' Bk-TGC AGT GCA ACA CTC AGG ATGC 5' 

(SEQ. ID. No. 9) 

10 

MTCG225 5' GAC ACG TCA CGT CTC GAG TCC TA 3' 

(SEQ. ID. No. 10) 

MTCGp326R 5' pCGT AGG ACT CAC_^C GTG ACG T - Bk 

15 (SEQ. ID. No. 11) 

MTCG03 2 6R 5' CGT AGG ACT CML^C GTG ACG T - Bk 

(SEQ. ID. No. 12) 

20 MTCG22 7 5» GAC ACG TCA CGT CTC GAG TCC TsAsC 3' 

(SEQ. ID. No. 13) 

MTCG22 8 5' GAC ACG TCA CGT CTC GAG TCC TAC 3' 

(SEQ. ID. No. 14) 

25 

New Msel linkers (Msel site = TTAA) 

MSTA275 5' GAC ACG TCA CGT CTC GAG TCC TQ 3' 

30 (SEQ. ID. No. 15) 

MSTA02 76R 3' Bk-TGC AGT GCA CTC AGG AGAT 5' 

(SEQ. ID. No. 16) 

35 MSTA275 5' GAC ACG TCA CGT CTC GAG TCC TQ 3' 

(SEQ. ID. No. 17) 

MSTAp276R 5' pl^G AGG ACT C AC A AC GTG ACG T - Bk , 

(SEQ. ID. No. 18) 

40 

MSTA0276R 5' IAS AGG ACT C AC A AC GTG ACG T - Bk 

(SEQ. ID. No. 19) 



MSTA278 
45 (SEQ. ID. No. 



5 • GAC ACG TCA CGT CTC GAG TCC T CT AA 3 ' 
20) 
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Msel cleaves human genomic DNA approximately every 125 bp. In contrast, when 
using Msp\ITaq\ as the second enzyme, the average size fragment is greater than 
1 ,000 bp. Many of the larger fragments (i.e. greater than 2,000 bp) will not amplify 
as well as smaller fragments in a representation, i.e. they will be lost to the sequencing 
5 gel. Therefore, in a Drdl-Msel representation, the number of unique fragments lost 
during PCR amplification may be greatly reduced. This can increase the number of 
amplified fragments per BAC and can facilitate alignment of BACs. 

DrcR representations of individual BACs can be used to link BACs 
together to form contigs. For BACs that generate a doublet sequence, "singlet" 

10 sequence information can still be obtained as long as the fragments are of different 

lengths. For example, an AG DrdllMsel representation of BAC RG253B13 results in 
two fragments of length 115 and 353 bases. Sequencing of these two fragments 
simultaneously v^ll result in two distinct regions of sequence. The first region 
(approx. 1-141 bases) will consist of an overlap sequence in which sequence 

1 5 information from both fragments will be observed. The last 25 bases of this sequence 
will be the linker adapter sequence on tlie Msel adapter. Thus, one can easily 
distinguish when the shorter fragment "ends" on the sequencing run. In all likelihood, 
it will also be more abundant and, hence, provide a stronger signal for those bases 
which were derived from that shorter fragment. If this stronger signal is not sufficient 

20 to recognize the unique sequence, then ultrafiltration (i.e. use of Amicon filters YM30 
and YM125 (made by Millipore, Danvers, MA)) may be used to enrich for "smaller" 
vs. "larger" fragments. The second region (approx. 141-353 bases) will consist only 
of sequence information from the longer fragment. Therefore, for any doublet in 
which the fragments are of different length, a "singlet" sequence will be generated for 

25 the non-overlapping region of the longer fragment. This non-overlapping region of 
the doublet can be utilized as a "singlet" in order to overlap BACs. A minimum of 8 
unique bases for a given distance from the DrcH site is sufficient to uniquely identify 
the sequence in the human genome, because the Drdi site provides an additional 6 + 2 
= 8 bases of unique sequence, bringing the total to 16 bases. 



30 
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How to align tlie BAG clones to create a complete contig of the entire human genome 

As mentioned earlier, there are only about 200,000 to 300,000 Drdl 
sites in the human genome. Since these are being sequenced in 6 different sets, there 
5 are about 35,000 to 50,000 Drdl sites in a given set. Alignment of the BAG clones is 
a simple process of constructing contigs in each of the 6 sets. 

Gonsider creating contigs in the sequencing set whose linker primer 
ends in *'GG". Suppose a given BAG =B1 clone contains a doublet sequence of #1 & 
U2, By searching the database one finds a second BAG =B2 clone containing a 

10 doublet sequence of #2 & #3, This implies that BAG clones Bl and B2 overlap, and 
further the order of the Drdl islands are #1, #2, and #3. (The approach for 
determining individual sequence runs #1, #2, and #3 are explained below.) Gonsider 
then additional BAGs: B3 with islands #3, #4, and #5, B4 v^th #4 & #6, B5 with #6, 
and B7 with #6 & #7. Then the BAG clone overlap is B1-B7 and the sequences are in 

15 the order: #1, #2, #3, #5, #4, #6, #7. In other words, the Drdl islands not only line up 
the BAG clone overlaps, they also provide the order they appear in the linear 
sequence. 

How firequent are the individual members of a set? With one non- 
palindromic Drdl site every 10-15 kb, the average distance between two Drdl sites 

20 with the same dinucleotide overhang sequence is 60 to 75 kb, or on average one to 
two such sites per BAG clone. Gomputer simulation on four BAG clones 
demonstrated 2 duplex sites separated by less than 25 kb, 5 duplex sites separated by 
between 25 kb and 50 kb, 2 duplex sites separated by between 50 kb and 75 kb, and 2 
duplex sites greater than 75 kb apart. Thus, a 5 -fold coverage of a region of DNA 

25 will create BAG clones with an average of two same overhang sites per BAG clone, 
but many such sites will be represented as either singlet or doublet reads in 
neighboring overlapping BAG clones. 

On a rare occasion, a long stretch of human DNA will lack a Drdl site 
with a given dinucleotide overhang (i.e. GO), such that even larger BAG clones of 

30 175-200 kb would not include two such sites. However, the BAG clone contigs are 
being pieced together using six sets of Drdl sequence information. This is akin to 
using six different restriction enzymes to create a restriction map of pBR322. Thus, a 
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"gap" in the contig is easily filled using sequence information from one of the other 5 
sets. The average BAG of 8-12 Drdl sites contains sequence information ranging 
from 4 to all 6 of the different contig sets. Thus, by combining the contig building 
among the 6 different sets, the entire genome contig can be built. 

5 

Using the Drdl is land database to obtain unique singlet sequences from overlapping 
doublet and triplet BAG clones. 

When BAG overlaps are foimd, the data may be immediately used to 
1 0 deduce unique singlet sequences at essentially all of the Drdl sites. As the simplest 
case, when comparing a doublet with a singlet sequence, subtraction of the singlet 
sequence will reveal the other singlet in the doublet sequence. In most cases, a 
doublet will be represented again as a singlet in a neighboring BAG. In some cases, 
two or three doublets will be connected in a series. Even one singlet at the end of a 
1 5 string of doublets may be used to deduce the unique sequences of the individual Drdl 
islands. 

Remarkably, just three overlapping doublets may be used to determine 
all four individual singlet sequences. For example, as shown in Figure 17, 4 unique 
singlet Drdl sequences from 2 overlapping doublet BAG clone sequences are obtained 

20 by aligning them as shown and comparing the corresponding bases. The common 
sequence between two doublets will either be identical, i.e. AA compared with AA 
(S), the same in one doublet allowing assignment, i.e. AA compared with AG 
indicates the common base is "A" (s), different among the doublets, also allowing 
assignment, i.e. AG compared with AG indicates the common base is "A" (d), or 

25 indeterminate, i.e. AG compared with AG does not reveal the base (i). On average, 3 
out of every 4 positions will allow assignment of the common sequence base. Based 
upon this analysis, the sequence common in each doublet can be determined with a 
nucleotide at each location receiving an S, s, or d designation. In this manner, a 
sequence is identified with locations having the i designation being assigned 

30 alternative bases. Figure 21 shows how the sequences for #2 and #3 are determined in 
this fashion. This information can then be used to compare the consensus sequences 
of #2 and #3 from which one can determine the overlap. With only 2 indeterminani 
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bases, the sequences for #2 aiid #3 can be found. Sequence information for §] and #4 
can then be obtained. 

The same analysis may be applied to alignment of one of the doublets 
with another neighboring doublet (or even triplet). See Figure 22. Although the 
5 sequence which is common between these sets is different from the original doublet 
comparison, the two consensus sequences may now be compared with the original 
doublet sequencing run. The probability that the indeterminate sequence in one 
sequence is at the same position as the other sequence is 1/4 x 1/4 = 1/16 for the 
doublet-doublet-doublet comparison and 1/4 x 7/16 = 7/64 for the doublet-doublei- 
1 0 triplet comparison. The remaining portions of the sequence, i.e. 1 5/1 6 and 57/64 of 
the sequence is determined, and this allows one to deduce the remaining singlet 
sequences. 

In the simulation of a doublet-doublet-doublet comparison, 78 out of 
80 bases could be uniquely assigned for all four singlet sequences. In the doublet- 
1 5 doublet-triplet comparison 73 out of 80 bases could be uniquely assigned for all three 
singlet sequences. This is far in excess of the 8 bases required to uniquely identify a 
given singlet sequence. 

Sequencing of Drdl island PGR fragments from BACs. or directly off B AGs. 

20 

As discussed supra, a method was provided for sequencing DNA 
directly from the plasmid or cosmid clone by PGR amplification of the insert. While 
PGR amplification has not been demonstrated for DNA of BAG clone length, the 
Drdl island may be PGR amplified by using a second frequent cutter enzyme to create 

25 small fragments for amplification. The second enzyme would contain a two base 5' 
overhang such that ligation/cutting could proceed in a single reaction tube. The 
ligation primers/PGR primers can be designed such that only Drdl-stcond enzyme 
fi-agments amplify. PGR primers may be removed by using ribose containing primers 
and destroying them with either base (i.e. O.IN NaOH) or using dU and UNO. An 

30 alternative approach to sequence DNA directly from PGR-amplified DNA uses 

ultrafiltration in a 96 well format to simply remove primers and dNTPs physically, 
and is commercially available from Millipore. 
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Examples of frequent enzymes with TA overhangs (and frequency in 
the human genome) are: Bfal (CTAG, 1 every 350 bp), Csp6I (GTAC, 1 every 500 
bp) and Msel (TTAA, 1 every 133 bp). For fragments with larger average sizes, four 
base recognition enzymes with CG overhangs may be used: Mspl (CCGG, 1 every 
5 2.1 kb), HinPn (GCGC, 1 every 2.5 kb), and Taql (TCGA, 1 every 2.6kb). 

There is a chance that the second site enzyme cleaves either too close 
to a Drdl site to generate sufficient sequence or, alternatively, too distantly to amplify 
efficiently. This site will simply not be scored in the database, just at Drdl sites with 
palindromic overhangs (i.e. AT) are not scored. If it is critical to obtain that precise 

1 0 sequence information, the problem may be addressed by using a different second 
enzyme. One advantage of using the "CG" site enzymes is that average fragment 
sizes will be larger and, consequently, will be amenable to generating neighboring 
sequence information fi-om the second site if needed. This may be helpful for 
increasing the density of internal sequence information linked to a BAG clone or 

1 5 plasmid/cosmid clone. 

Plasmids containing colEl repHcation origins (i.e. pBR322, pUC 
derivatives) are present at high copy number which may be increased to 1 OO's by 
growing clones for two days or to l,0O0's by amplification with chloramphenicol. 
This should provide sufficient copy number such that it is not necessary to separate 

20 plasmid/cosmid DNA from host bacterial chromosomal DNA. On the other hand, 
BAG clone vectors are based on the F factor origin of replication may be present at 
copy numbers equal or only slightly higher than the bacterial chromosome. Thus, it 
will probably be necessary to partially purify BAG clone DNA from bacterial 
chromosome DNA. The relative advantages and disadvantages of PGR amplification 

25 followed by direct sequencing vs. rapid purification of plasmid, cosmid, or BAG 
clone followed by sequencing need to be determined experimentally. 

Alternative enzyme?; Sapl and Bgfl, 

30 There may be regions of the genome which contain less than two 

readable Drdl sequences. One solution to this problem is to use a second enzyme 
with a comparable frequency in the hirnian genome. By slightly modifying the 
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procedure, 16 linker/primer sets may be used on split palindrome enzymes which 
generate a 3 base 3' overhang. Since the overhang is an odd number of bases, it is not 
necessary to exclude the palindromic two base sequences AT, TA, GC, and CG. To 
reduce the number of ligations from 64 (all the different possible 3 base overhangs) lo 
5 16, the linkers and primers are degenerate at the third position, i.e. end with NTC or 
NGC. Since there are 3 levels of specificity in the ligation and sequencing step, the 
third base degeneracy will not interfere with the fidelity of the reaction. 

Of the 4 commercially available split palindrome enzymes which 
generate a 3 base 3' overhang, Bgll (GCCNNNN^NGGC (SEQ. ID. No. 21)) and 

1 0 , Drain (CACNNN^GTG) are present at low enough frequencies to be compatible with 
Drdl, There are 60 Bgll sites in about 550 kb of the four sequenced BAG clones, or an 
average of 1 BgR site per 9 kb. Since the linkers can ligate to both sides of a Bgfl site, . 
there are twice as many ends, (i.e. sequences) generated as with the Drdl sites. See 
Figure 1 6. Using BgR, there are two levels of specificity for creating a tmique 

1 5 representation: (i) ligation of the top strand, and (ii) extension of the sequencing 
primer with polymerase. Unlike Drdl^ the use of a last base degeneracy in the Bgll 
linker does not allow one to determine sequence information from only one side. If 
there are too many Bgll sites in a given BAG, or there is a need to obtain singlet 
sequence information, one may obtain additional specificity by designing primers 

20 which reach in one additional base on the 3' side of the ligation junction (i.e. 

GCGNNNN^NGGG (SEQ. ID. No. 22)). As v/ith Drdl, the conserved GGC on the 3' 
side of the cut site allows all sequences in a set to be easily compared in the correct 
alignment. As wath the Drdl site, use of a second enzyme or enzyme pair (Mypl 
and/or Taql) and corresponding linkers allows for specific amplification of the Bgll 

25 site fragments (See Figure 16A). 

One type lis enzyme, Sapl (GCTCTTCNI/4), generates a 3 base 5' 
overhang 3' which allows for unidirectional ligation, i.e. simultaneous cutting and 
ligation will only provide the sequence from one side. See Figure 17. There are 69 
Sapl sites in about 550 kb of the four sequenced BAG clones, or an average of 1 Sapl 

30 site per 8 kb. One advantage of Sapl is that most vectors lack this site. Two 

disadvantages of Sapl are the 5' 3 base overhang will be filled in if using the enzyme 
after a PGR amplification, and the need to test a few (5-10) different starting positions 
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to align doublet or triplet sequences precisely with each other. If there is a need to 
obtain a singlet sequence, one may obtain additional specificity by designing primers 
which reach in one or two additional base on the 3' side of the ligation junction (i.e. 
GCTCTTCN^NNNNN (SEQ. ID. No. 23)). One big advantage of using this enzyme 
5 is the majority of Sapl sequences yield singlet reads. 

The probabilities of obtaining two readable sequencing runs from a 
BAC clone containing from 2 to 36 Bgll or Sapl sites have been calculated. For the 
average of 12-17 Bgll sites per BAC clone (=24-34 ends), the probability is 99.9% for 
containing at least two readable (singlet or doublet) sequences. For the same clones, 

10 from 93%-98% will contain at least two singlet sequences, making alignment even 
easier for those clones. For the average of 12-25 Sapl sites per BAC clone, the 
probability is 99.9% for containing at least two readable (singlet or doublet) 
sequences. For the same clones, from 98.8%-99.3% will contain at least two singlet 
sequences, making alignment even easier for those clones (see Figure 17A). 

1 5 Altliough there are a total of 1 6 different ligation primers which may 

be used with the Bgll or Sapl sites (indeed, up to 64 may be used), it is not necessary 
to use all of them. Given the frequency of Bgll sites in the human genome, and the 
fact that a single site provides two non-symmetric overhangs, 8 different ligation 
primers would be sufficient. Should a Bgll site be present in low abundance repetitive 

20 DNA, that overhang would also not be used. Simulation on a dozen BAC clones 

would provide a more complete assessment of which 8 primers should be chosen for a 
Bgll representation. With Sapl, each site provides one non-symmetric overhang, so 
the majority of Sapl sites per BAC clone provide singlet or doublet reads. Thus, 
anywhere from 6 to 10 different ligation primers may be chosen to provide a robust 

25 set of Sapl islands to assure overlap of all the BAC clones. The advantage of using 
BgH or Sapl with 6 to 10 different ligation primers is that additional primers may be 
used as needed on only those BAC clones which represent the ends of contigs. The 
underlying concept is that each unique linker creates a set of sequences which may be 
linked through singlet and doublet reads, or BAC clone overlap, or both. 



30 
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Presence of Drdl or other sites in BAG o r plasmid vectors. 

One important technical note is that the most common BAG vector, 
pBeloBAGl 1 (Genbank Accession # U51 1 13 for complete DNA sequence) and the 
5 common plasmid vectors contain 4 and 2 Drdl sites respectively. 

Thus, one needs to fine tune the experimental approach to circumvent 
restriction sites in the vector sequences. The three basic approaches are to (i) remove 
the restriction sites from the vector before constructing the library, (ii) destroy the 
vector restriction sites in clones firom a given library, or (iii) suppress amplification of 

10 vector fragments using sequence specific clamping primers. 

Restriction sites can be removed fi'om the BAG vector pBeloBAGl 1 
which contains 4 Drdl sites, 4 Bgfl sites, and 2 Sapl sites. See Figure 21 . The 
procedure for removing Drdl sites in a single cloning step will be described, and it is 
generally applicable to all the sites. One of the tricks of split palindrome enzymes 

1 5 which generate a 3 base 3' overhang such as Bgll (GCGNNNN^NGGG (SEQ. ID. 
No. 21)), Drain (CACNNN^TG), Alwnl (GAGNNN^CTG), and P/Ml 
(CCANNNN^NTGG (SEQ. ID. No. 24)) is that there is a high chance of creating 
fragments where all the sticky ends are unique. In such a case, a plasmid may be 
cleaved with the enzyme, one or more pieces replaced, and, then, in the presence of 

20 T4 ligase, the plasmid reassembles correctly and can be recovered by transforming 
into E, coli. The replacement fragments lack the Drdl site(s) such that silent 
mutation(s) are introduced into any open reading frames. The replacement fragments 
are generated by overlap PGR, and the ends of such PGR fragments converted to 
imique overhangs using the split palindrome enzyme (i.e. BgR), To illustrate with 

25 pBeloBAGl 1 , two overlap PGR primers are designed to eliminate the Drdl site at 
1 ,704, and the fragment is generated using two primers just outside BgR sites at 634 
and 2,533. This firagment is cleaved with BgH after PGR amplification. Likewise, six 
overlap PGR primers are designed to eliminate the Drdl sites at 2,616, 3,5 1 1 , and 
4,807 and the whole Augment is generated using two primers just outside Bgll sites at 

30 2,533 and 6,982. This fragment is also cleaved with BgH after PGR amplification. 
The fragments are mixed with Bgll cut pBeloBAGl 1, and ligase is added, in the 
presence of Drdl. Thus, circular ligation products containing the newly PGR 
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amplified fragments lacking Drdl sites are selected for, and recovered after 
transformation into E, colL The pBeloBACl 1 vector has been modified (in 
collaboration with New England Biolabs) essentially as described above to create 
vector pBeloBACl 1 No Drdi, which as its name implies, lacks DrcR sites. The same 
5 principle may be used to remove the Sapl sites and even the Bgll sites or all 10 sites if 
desired. In the latter case, the split palindrome enzyme P/ZMI (4 sites in 
pBeloBACl 1) would be used. The same procedure may be applied to plasmid vectors 
such as pUCl 9, which contain only 2 each of Drdi and BgR sites and no Sapl sites. 
See Figure 24. 

10 The vector restriction site or its sequence can be destroyed by treating 

the vector-insert DNA with various restriction enzymes. The vector sites can be 
eliminated so that the {Drdl) enzyme does not cut at that position or, alternatively, 
generates such a small sequence (i.e. 10-20 bases) that overlap firom vector sequence 
only minimally interferes with interpretation of the data. This may appear as extra 

15 work; however, when using simultaneous restriction/ligation conditions, it is simply a 
matter of including (an) additional restriction endonuclease(s) in the same mixture. 
The linker primers will not ligate onto the other restriction site overhangs as they are 
not compatible. 

Representational amplification from BACs may be modified to 
20 suppress amplification of vector fragments using sequence specific clamping primers. 
The pBeloBACl 1 and pBACe3.6 vectors both contain Drdl sites complementary to 
AA-, CA-, and GA- overhangs. Clamping oligonucleotides which bind specific Drdl 
firagments (i.e. vector derived) and block annealing of PCR primers or PCR 
amplification, were designed as PNA or propynyl derivatives and are listed in 
25 Tables 5 and 6. 
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Tabic 5. PNA designed for suppression of DrdI sites associated with the 
pBeloBACll vector. 



Primer 



Sequence (NH2 -> CONH2) 



CA-PNA27-3 
GA-PNA23-4 



NH2 GCC AGT COG AGC ATC AGG CONH2 (SEQ. 

ID. No. 25) 

NH2 CCC CGT GGA TAA GTG GAT CONHa (SEQ. 

ID. No. 26) 



GA-PNA25-2 



NH2 ACA CGG CTG CGG CGA GCG CONH2 (SEQ. ID. 
No. 27) 



AA-PNA21. 



NH2 GCC GCC GCT GCT GCT GAC CONH2 (SEQ. 
ID. No. 28} 



5 
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Table 6. Propynyl Primers designed for suppression ofDrdl sites associated 
with the pBeloBACll vector. 



Primer Sequence ( 5 ' ^3 ' ) 



AA 


Del 


PY3 


5' GsCs{pC) sGsCs(pC) sGCT G{pC)T G(pC)T 
GA(pC) GG{pT) GTG A(pC)G TT -Bk 3' (SEQ. 
ID. No. 29) 


GA 


CI 


FY 6 


5' GsAs(pC) sTsGsT s(pC)AT T(pT)G AGG 
G{pT)G AT(pT) TGT (pC)AC A(pC)T GAA AGG G 
-Bk 3' (SEQ. ID. No. 30) 


GA 


CI 


PYIO 


5'GsAs(pT) sAsGsT s{pC)TG AGG G (pT) T 
AT(pC) TGT (pC)AC AGA T(pT)T GAG GG{pT) 
GG-Bk 3' (SEQ. ID. No. 31) 


CA 


CI 


PY14 


.5' CsAs(pT) sAsGsT s(pC)AT GAG (pC)AA 
(pC)AG TTT (pC)AA TGG (pC) CA GT(pC) GG - 
Bk 3' 3' (SEQ. ID. No. 32) 



5 The designations (pC) and (pT) represent propynyl-dC and propynyl-dT, respectively. 

The PNA oHgonucleotides were designed to maximize Tm values in an 
18mer sequence, while attempting to also maximize pyrimidine content and avoiding 
three purines in a row. The propynyl derivative oligonucleotides were designed to 

10 overlap the Drdl site by two bases, and to contain a total of about 5 to 9 and 

preferably 7 propynyl dC and propynyl dU groups to increase the Tm, as well as 
about 4 to 8 and, preferably, 6 thiophosphate groups at the 5' side to avoid 5 '-3' 
exonuclease digestion by Tag polymerase during amplification. (Propynyl derivatives 
are known to increase oligonucleotide Tm values by approximately 1.5-1.7°C per 

1 5 modification, while thiophosphate modifications slightly reduce Tm values by about 
0.5°C per modification). These propynyl derivative clamping oligonucleotides were 
fi-om approximately 25 to 40 bases in length. Alternative propynyl designs which do 
not overlap the Drdl site would also be predicted to suppress vector amplification. 
Alternative nucleotide modifications which both increase Tm values and prevent 5 '-3' 

20 exonuclease digestion by Tag polymerase, such as 2'o-methyl derivatives, may also 
be used. Tm values for both PNA and propynyl derivative clamps were generally 
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above SS^'C and, preferably, above 90^C to achieve effective clamping. When the 
propynyl derivative clamping oligonucleotides were synthesized without either the 
propynyl or thiophosphate modifications, they were insufficient to effectively block 
amplification of vector sequences. In general, reactions using 10 ng of digested/linker 
5 ligated BAG DNA were subjected to 30-35 cycles (94*^0, 15 sec, 65°C, 2 minutes) of 
PGR amplifications using 25 picomoles each of primers and 50 picomoles of the 
corresponding clamp. These conditions were sufficient to allow for amplification of 
insert Drdl representational fragments while inhibiting amplification of the vector 
sequences. The principles of using PNA clamps to suppress amplification of 

10 undesired fi*agments have been described in the literature (Gochet O, et. al. "Selective 
PGR Amplification of Functional Immunoglobulin Light Chain from Hybridoma 
Containing the Aberrant MOPG 21 -Derived V kappa by PNA-mediated PGR 
Clamping," Biotechniques 26:818-822 (1999) and Kyger E. et. al. "Detection of the 
Hereditary Hemochromatosis Gene Mutation by Real-time Fluorescence Polymerase 

15 Chain Reaction and Peptide Nucleic Acid Clamping," Anal Biochem 260:142-148 
(1998), which are hereby incorporated by reference). 

rV. Comparison of Drdl Island Approach With Other Endonucleases 

20 Different approaches to generate representations of the genome. 

The Drcfl is a unique restriction endonuclease. It has an infrequent 6 
base recognition sequence and generates a degenerate 2 base 3' overhang 
(GACNNNN^NNGTG). Sequences adjacent to a Drdl site may be PGR amplified 

25 using the 2 degenerate bases in the overhang to define a representation, and an 
adjacent more common site (such as Mspl). The degenerate 2 base 3 ' overhang 
allows for both biochemical selection and bubble PGR to assure that only the Drdl 
island amplifies (and not the more abundant Mspl - Mspl firagments). Using Drdl. 
there are three levels of specificity for creating a unique representation: (i) ligation of 

30 the top strand, (ii) Ugation of the bottom strand linker, and (iii) extension of the 

sequencing primer with polymerase. In addition, if there are too many Drdl sites in a 
given BAG clone, or there is a need to obtain singlet sequence information, one may 
obtain additional specificity by designing primers which reach in one or two 
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additional bases on the 3' side of the ligation junction (i.e. GACNNNN^NMGTC 
(SEQ. ID. No. 33)), since the central degenerate bases are determined by the 
specificity of the Hgation reaction (i.e. GACNNNN^NNGTC (SEQ. ID. No. 33)). 
Further, the conser\'ed GTC on the 3' side of the cut site allows all sequences in a set 
5 to be easily compared in the correct alignment. Finally, the degenerate 2 base 

overhang allows one to obtain sequence information from either one, or the other, or 
both sides of the Drdl site. 

However, there may be a need to consider other restriction 
endonuclease sites, for example, when stalling with a library made from a BAG 
1 0 vector with too many Drdl sites. 

The use of split palindromic enzymes which generate a 3 base 3' 
overhang, such as Bgll (GCCNNNN^NGGC (SEQ. ID. No. 21)) and type lis enzyme, 
like Sapl (GCTCTTCNl/4), which generates a 3 base 5' overhang are described 
above. 

15 A seemingly simple solution to obtaining sequence information is to 

use a symmetric palindromic enzyme, such as BamHl, which cuts the BAG at several 
places. Figure 25 is a schematic drawing showing the sequencing of BamUl islands 
in random BAG clones in accordance with the present invention. This procedure is 
largely the same as was described previously for Drdl, Bgfl, and Sapl islands witii 

20 respect to Figures 1 , 5, 1 6, and 1 7. After linker ligation, some of the fragments will 
be imder 4 kb and, thus, will amplify in a PGR reaction. The idea here is to amplify 
all the fragments in a single tube and, then, obtain a representation through use of 
carefully designed sequencing primers. The selectivity in this type of representation 
is achieved by using a sequencing primer, whose last two bases extend beyond the 

25 BamVa site (i.e. G^GATTGM). It would be difficult to achieve a specificity of 3 
bases beyond the site. In tlie example of the 170 kb BAG containing the Met 
Oncogene, there was considerable clustering of the sites which were close enough to 
amplify effectively. The results of using BamHl as the restriction enzyme are shown 
in Figure 26. 

30 It is also difficult to find an enzyme which cleaves the DNA frequently 

enough that some fragments are under 4kb, but not so frequent that too many 
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fragments amplify, as when using EcoRl or HindllL Use of enzymes which are less 
frequent due to a TAG stop codon in one of the potential reading frames {AvrU, 
C^CTAGG; Nhel, G^CTAGC, and Spel A^CTAGT) also have problems with 
clustering. The results of using these enzymes as the restriction enzyme in 
5 accordance with the present invention are shown in Figure 27. 

Other symmetric palindromic enzymes which may be used are: Kpnl, 
Sphl Aatll, Agel Xmal, NgoMl, BspEl, Mlul, Sacll, Bsim, Pstl, and ApaLl 

To overcome the above clustering problem, one could use an enzyme 
which cuts more frequently due to a degeneracy, but then use linkers with only one of 

10 the 2 or 4 possible degeneracies such that only a few fragments amplify. For 
example, Accl has 4 different recognition sequences (GT^MKAC = GT^ATAC, 
GT^AGAC, GT^CTAC, and GT^CGAC), and BsiHKAl also has 4 different 
recognition sequences (GWGCW^C = GAGCA^C, GAGCT^C, GTGCA^C, and 
GTGCT'^C). Again, the selectivity in this type of representation is achieved by using 

1 5 a sequencing primer, whose last two bases extend beyond the BsiViKAl site (i.e. 
GAGCA^CMN)- The advantage of these types of restriction sites is that a non- 
palindromic overhang may be used for the linker. In simulations of these sites on the 
171 kb BAG, only a few fragments amplify, including some which would provide too 
few bases of sequence information to be meaningful (i.e. 19-44 bp). Figure 28 is a 

20 schematic drawing showing the sequencing of BsiHKAl islands in random BAG 

clones in accordance with the present invention. This procedure is largely the same as 
was described previously for Drdl, BgK, and Sapl islands with respect to Figures 1, 5, 
16, and 17. The results of using BsiHKAl and Accl as the restriction enzymes are 
shown in Figure 29. 

25 . All alternative is to use an infrequent restriction endonuclease site with 

a middle base degeneracy in combination with a more frequent cutter, analogous to 
use of Drdl as described earlier. By using a primer for only one of the degenerate 
sequences, one can obtain sequence information from either one or the other side of 
the site, such as by using SanDl (GG^WCCC). Here, however, all the fragments are 

30 amplified simultaneously in the initial PGR, and selectivity is achieved by using a 
sequencing primer, whose last two bases extend beyond the recognition site 
(GG^WCCCNN). Another site, SexAl (A^CCWGGT), may also work, however, 



wo 00/40755 



-77- 



PCT/USOO/00144 



the 5 base overhang may be large enough to allow substantial misligations of primer 
to overhangs containing a mismatch. In simulations on the 171 kb BAC, all SanDl 
and SexAl sites were singlet or doublet reads. Figure 30 is a schematic drawing 
showing the sequencing of SanDl islands in random BAC clones in accordance with 
5 the present invention. This procedure is largely the same as was described previously 
for Drdl, BgH, and Sapl islands with respect to Figures 1,5, 16, and 1 7. The results 
of using ^a^DI and iSejcAI as restriction enzymes are shown in Figure 3 1 . 

RsrW (CG^GWCCG) is an enzyme which provides the same overhang, 
but is found less frequently than SariDl. For cases where a higher frequency site is 
10 required, the enzymes Ppul (RG^WCCY), Avail (G'^GWCC), £coO109 
(RG^NCCY), or Bsum (CC^TNAGG) may be used. 

Presence of Drdi or other sites in BAC or plasmid vectors. 

1 5 One important technical note is that the most common BAC vector, 

pBeloBACl 1 contains 4 Drdi sites, 4 BgK sites, 2 Sapl sites, 6 Accl sites, 8 BsiViKAl 
sites, 1 Spel site, 1 BamYH site, and 1 iS'exAI site. See Figures 23 and 32-34. 

As discussed above, there are three basic approaches to circumvent the 
problem of the cloning vector having its own restriction sites: (i) remove the 

20 restriction sites firom the vector before constructing the library, (ii) destroy the vector 
restriction sites in clones from a given library, or (iii) ignore the vector restriction 
sites £ind use more selective sequencing primers. For the sites described above, the 
Accl^ jS^/HKAI, Sjpel, and BamUl sites do not require additional modification of the 
pBeloBACl 1 vector, because the amplification strategy with these sites need two 

25 neighboring sites of the correct sequence to create a PGR fragment. In addition, 
pBeloBACl 1 does not contain any ^vrll, hlhel, or SariDl sites. 

Distribution of representative Drdl and SariDl sites i n the genome. 

30 A number of advanced BLAST searches of the current dbest and dbsts 

databases were performed to determine if there are any unanticipated biases in the 
distribution of Drdi and in a smaller survey of SariDl sites. 
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Distribution of representative Drdl sites in the genome. 

I . Query: GACAAAANNGTC (SEQ. ID. No. 34) 



5 Expect ICQ 

Filter: None 

Other Advanced Options: M=l N=-4 S-12 S2=12 

Non- redundant DBEST Division 1,814,93 8 sequences; 685,416,569 total 
10 letters. 

DBSTS Division 59,288 sequences; 21,143,395 total letters. 



Query: 


1 


GACAAAAAAGTC 


12 


dbest 


51 


dbsts 


3 


Query : 


1 


GACAAAAACGTC 


12 


dbest 


20 


dbsts 


(0) 


Query : 


1 


GACAAAAAGGTC 


12 


dbest 


28 


dbsts 


1 


Query: 


1 


GACAAAAATGTC 


12 


dbest 


77 


dbsts 


4 



25 Query: 1 GACAAAACAGTC 12 dbest 86 dbsts (0) 

Query: l GACAAAACCGTC 12 dbest 5 dbsts (0) 

Query: 1 GACAAAACGGTC 12 dbest 4 dbsts (0) 

Query: 1 GACAAAACTGTC 12 dbest 96 dbsts 3 



30 



35 

Query: 
Query : 
40 Query: 
Query: 



45 



50 



55 



Query : 
Query: 
Query: 
Query: 

Total 



1 GACAAAAGAGTC 12 dbest 62 dbsts 1 

1 GACAAAAGCGTC 12 dbest 6 dbsts (0) 

1 GACAAAAGGGTC 12 dbest 2 0 dbsts 4 

1 GACAAAAGTGTC 12 dbest 89 dbsts 1 

1 GACAAAATAGTC 12 dbest 9 dbstS 4 

1 GACAAAATCGTC 12 dbest 4 dbsts 1 

1 GACAAAATGGTC 12 dbest 29 dbsts (0) 

1 GACAAAATTGTC 12 dbest 45 dbsts 2 

633 24 
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Ouerv: GACAAACNNGTC (SEQ. ID. No. 35) 



5 Expect 100 

Filter : None 

Other Advanced Options: M=l N=-4 S=12 S2=12 

Non-redundant DBEST Division 1,814,938 sequences; 685,416,569 total 
10 letters. 

DBSTS Division 59,288 sequences; 21,143,395 total letters. 



15 



Query: 1 GACAAACAAGTC 12 dbest 49 dbsts 2 

Query: 1 GACAAACACGTC 12 dbest 47 dbsts 2 

Query: 1 GACAAACACGTC 12 dbest 20 dbsts 5 

20 Query: 1 GACAAACACGTC 12 dbest 22 dbsts 5 



25 



Query: 1 GACAAACCAGTC 12 dbest 2 9 dbsts 1 

Query: l GACAAACCCGTC 12 dbest 14 dbsts 1 

Query: 1 GACAAACCCGTC 12 dbest 3 dbsts (0) 

30 Query: 1 GACAAACCTGTC 12 dbest 17 dbsts 3 



35 



Query: 1 GACAAACCAGTC 12 dbest 21 dbsts (0) 

Query: 1 GACAAACCCGTC 12 dbest 15 dbsts 1 

Query: 1 GACAAACCCGTC 12 dbest 8 dbsts (0) 

40 Query: 1 CACAAACGTGTC 12 dbest 33 dbsts 7 



45 



Query: 1 GACAAACTAGTC 12 dbest 15 dbsts 1 

Query: 1 GACAAACTCGTC 12 dbest 8 dbsts (0) 

Query: 1 GACAAACTCGTC 12 dbest 4 0 dbsts 2 

50 Query: 1 GACAAACTTGTC 12 dbest 59 dbsts 2 



Total 



400 



32 



55 
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Query: GACAAAGNNGTC (SEQ. ID. No. 36) 



10 



15 



20 



25 



30 



35 



40 



45 



50 



Expect 100 
Filter: None 

Other Advanced Options: M=l N=-4 S=12 S2 = 12 

Non-redundant DBEST Division 1,814,93 8 sequences; 6 85,416,56 9 total 
letters. 

DBSTS Division 59,288 sequences; 21,143,3 95 total letters. 



Query: 1 GACAAAGAAGTC 12 dbest 43 dbsts 0 

Query: 1 GACAAAGACGTC 12 dbest 6 dbsts 1 

Query: 1 GACAAAGAGGTC 12 dbest 62 dbsts 2 

Query: 1 GACAAAGATGTC 12 dbest 29 dbsts 5 

Query: 1 GACAAAGCAGTC 12 dbest 31 dbsts 3 

Query: 1 GACAAAGCCGTC 12 dbest 49 dbsts (0) 

Query: 1 GACAAAGCCGTC 12 dbest 5 dbsts (0) 

Query: l GACAAAGCTGTC 12 dbest 5 dbsts 1 

Query: 1 GACAAAGCAGTC 12 dbest 15 dbsts 1 

Query: 1 GACAAAGCCGTC 12 dbest 8 dbsts 1 

Query: 1 GACAAAGCCGTC 12 dbest 36 dbsts (0) 

Query: 1 GACAAAGCTGTC 12 dbest 14 dbsts (0) 

Query: 1 GACAAAGTAGTC 12 dbest 7 dbsts (0) 

Query: 1 GACAAAGTCGTC 12 dbest 21 dbstS (0) 

Query: 1 GACAAAGTCGTC 12 dbest 94 dbsts 4 

Query: 1 GACAAAGTTGTC 12 dbest 21 dbsts (0) 

Total = 446 18 



55 



4 . 



Query; 



TCTGGGACCCNN (SEQ. ID. No. 37) 
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Expect 100 
Filter: None 

Other Advanced Options: M=l N=-4 S=12 S2=12 

5 

Database: Non-redundant Database of GenBank STS Division 
59,293 sequences; 21,148,385 total letters. 



Dbsts 

10 

Query: 1 TCTGGGACCCAA 12 3 

Query: 1 TCTGGGACCCAC 12 1 

15 Query: 1 TCTGGGACCCAC 12 7 

Query: l TCTGGGACCCAT 12 2 

20 

Query: 1 TCTGGGACCCCA 12 6 

Query: 1 TCTGGGACCCCC 12 6 

25 Query: 1 TCTGGGACCCCG 12 1 

Query: 1 TCTGGGACCCCT 12 5 

30 

Query: 1 TCTGGGACCCGA 12 (0) 

Query: 1 TCTGGGACCCGC 12 1 

35 Query: 1 TCTGGGACCCGG 12 3 . 

Query: 1 TCTGGGACCCCT 12 (0) 

40 

Query: 1 TCTGGGACCCTA 12 2 

Query: 1 TCTGGGACCCTC 12 8 

45 Query: 1 TCTGGGACCCTG 12 3 

Query: 1 TCTGGGACCCTT 12 5 

50 Total 53 



The advanced BLAST search requires a minimum of 12 bases to look 
55 for an exact match. In the initial stages of doing this search, the database computer 
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went down (probably unrelated); however, as a precaution, responses for a particular 
sequence search were limited to 100. Since the dbest database contains about 1/4 
nonhuman sequence, such sequences were removed in tallying the total for that 
search. Thus, any number between 75 and 100 most probably reflects a lower value 
5 for that particular Drdl site. Nevertheless, since many dbest searches returned less 
than 100 hits, it is unlikely that a particular total is grossly under-represented. 
Nevertheless, to be accurate, the following values should be viewed as lower 
estimates. 

For the Drdl site, there are 6 non-palindromic two base 3' overhangs to 
10 consider: AA, AC, AG, CA, GA, and GG. Searches were performed on a 

representation of AA, AC, and AG sequences. The first two bases in the middle N6 
degenerate sequence were arbitrarily chosen as "AA", the next two bases were AA, 
AC, or AG, and the last two bases were entered 16 times for each of the NN 
possibilities. 

1 5 For all three searches (i.e., GACAAAAANNGTC (SEQ. ID. No. 34), 

GACAAACNNGTC (SEQ. ID. No. 35), and GACAAAGNNGTC (SEQ. ID. 
No. 36)), sequences containing a CG dinucleotide in either database or a "TAG" 
trinucleotide in the dbest database were, as expected, underrepresented. The STS 
database is too small to draw major conclusions; however, the totals on all three 

20 searches were within 2-fold of each other. 

For the STS database of less than 21,000,000, 18-32 hits of human 
sequence were obtained which correlates to 1 site in 700,000 - 1,100,000 bases. 

For the dbest database of less than 685,000,000, 400 - 633 hits of 
human sequence were obtained which correlates to 1 site in 1,100,000 to 1,700,000 

25 bases. 

Again, the middle N6 has 4096 different sequences. Because of the 
palindromic nature of GACAAAAAAGTC (SEQ. ID. No. 38), whenever it was 
searched, the program automatically also searched GACTTTTTTGTC (SEQ. ID. 
No. 39), and each middle A A sequence was searched with 16 different flanking 
30 dinucleotides. All the sequences with a middle A A or TT is 4096/8 = 512, then divide 
by 16 = 32. 
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For the best results, 400, 446, and 633 sequences in 685,000,000 is 
equivalent to 1,752, 1,953, and 2,772 sequences, respectively, in 3,000,000,000. It 
should be a little more, because the 685,000,000 contains approximately 1/4 sequence 
which is non-human DNA. 
5 So the total number of Drdi sites with AC, AG, and AA overhangs are 

32 X 1,752; 1,953; and 2,772; = 56,064; 62,496; and 88,704 sites, respectively. Since 
A-T bases are somewhat more frequent in the genome than G-C bases, the above 
numbers are a slight over-representation. This occurs, because they are based on 
numbers obtained using "AA" as the arbitrarily chosen invariant first two bases in the 
10 Drdl internal sequence. For the other 3 middle 2 base overhangs, "CA" is predicted 
to be as frequent as "AG", i.e. about 60,000 sites; "GA" (whose complement is "TC") 
is predicted to be as frequent as "AC", i.e. about 55,000 sites; and "GG" (whose 
complement is "CC") is predicted to be less frequent than "AC", i.e. about 45,000 
sites. 

1 5 The above calculations are consistent with the earlier prediction of 

200,000 to 300,000 non-palindromic Drdl sites per genome; i.e. on average of 33,000 
to 50,000 sites for each overhang. 

Less detailed searches with SanDl were performed by arbitrarily 
choosing the first 3 bases of a 12 base sequence as "TCT" and using the GGGACCC 

20 site with the last two bases being entered 16 times for each of the NN possibilities. 

For the STS database of less than 21 ,000,000, 53 hits of human 
sequence were obtained which equals 1 site in 400,000 bases. 53 in 21,000,000 is 
equivalent to 7,571 in 3,000,000,000. Since there are 64 different combinations for 
the first 3 bases, that gives a prediction of 484,571 SanDl sites in the genome. These 

25 may be divided into 1 6 sets, on average of 30,000 sites per set. 

The database searches demonstrate the distribution of Drdl sites (as 
well as SartDl and other selected sites) allow for the creation of from 5 to 1 6 sets 
based on specific 2 base overhangs or neighboring 2 bases, where each set has from 
about 30,000 to about 90,000 members, and may be used to create entire genome 

30 overlapping contig maps. 



wo 00/40755 



PCT/USOO/00144 



-84. 

Option 1 : 1 .800.000 short seq uencing re actions generate approximately lOO.QOQ- 
150.QQQ Drdl islands to create an entire BAG contig. 

Figure 2 provides a scheme for sequencing representations of BAG 
5 clones. Two approaches may be considered for preparing DNA. One rapid approach 
is to pick individual colonies into lysis buffer and lyse cells under conditions which 
fragment chromosomal DNA but leave BAG DNA intact. Ghromosomal DNA is 
digested by the ATP dependent DNase from Epicentre which leaves GGG and OC 
BAG DNA intact. After heat treatment to inactivate the DNase, restriction digestion, 

1 0 ligation of linker adapters, and PGR amplification are all performed in a single tube. 
The products are then aliquoted and sequencing is performed using specific primers to 
the adapters. This first approach has the advantage of obviating the need to grow and 
store 300,000 BAG clones. 

An alternative approach is to pick the colonies into 1 .2 ml growth 

15 media and make a replica into fresh media for storage before pelleting and preparing 
crude BAG DNA from a given liquid culture similar as described above. This second 
approach has the advantage of producing more BAG DNA, such that loss of an island 
from PGR dropout is less likely. Further, this approach keeps a biological record of 
all the B AGs, which may become usefiil in the fiiture for techniques such as exon 

20 trapping, transfection into cells, or methods as yet undeveloped. 

Figures 5 is an expanded versions of Figure 2 detailing the subtleties of 
the linker-adapter ligations and bubble PGR amplification to select only the Drdl- 
Mspl fragments. Figure 7 describes the three levels of specificity in using the Drdl 
island approach. 

25 With an average BAG size of 100-150 kb, total of 20,000 to 30,000 

BAG clones would cover the human genome, or 300,000 clones would provide at 
least 10-fold coverage. For each clone, one requires 6 sequencing runs for a total of 
1,800,000 sequencing reactions. However, only 80 bp of sequence is required to 
deconvolute singlet/doublet information. At a conservative estimate of 1 run per hour 

30 of 96 reaction, with 24 loadings/day, this equals 2,304 sequencing reads/PE 3700 
machine/day. Assume access to 200 machines. 
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1,800,000/2,304 sequencing reads/machine/day = 885 machines days/200 machines = 
4.4 days 

The above would provide about 80 bp anchored sequence information 
for about 100,000 to 150,000 Drdl sites, spaced on average every 20-30 kb. 

If the machine is run to read 200-300 bp, this equals 1,240 reads/day, 

then: 



1 ,800,000/1 ,240 sequencing reads/machine/day = 1 ,452 machines days/200 machines 
10 =7.3 days 

The above would provide about 200-300 bp anchored sequence 
information for about 100,000 to 150,000 Drdl sites, spaced on average every 20-30 
kb. 

15 If tlie machine is run to read 500-600 bp, this equals 760 reads/day, 

then: 



I, 800,000/760 sequencing reads/machine/day = 2,368 machines days/200 machines = 

II, 8 days 

20 

The above would provide about 500-600 bp anchored sequence 
information for about 100,000 to 150,000 Drdl sites, spaced on average every 20-30 
kb. 

Experiments will be needed to access the quality of reads and ability to 
25 deconvolute the sequence when reading out 80, 200, or 500 bp. In simulations, it was 
noted that doublets often contained one smaller and one larger fragment. Thus, useful 
information may be obtained from a long read, where the first 200 bases are 
predominantly from the shorter fragment (reading as a strong singlet sequence with a 
weak doublet behind it), and when that fragment ends, the weaker sequence from the 
30 larger fragment will be easy to read and interpret (See Figure 35). This may require 
the algorithm to include alignment of fragments starting at a later position; however, 
this should not be too difficult. 
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Option 2: 3.6 00.000 short sequencing reactions generate approximately 150.000- 
200,000 Drdl islands to create an entire BAG contig. 

5 Should pilot studies suggest that some sequence reads are difficult to 

interpret, two sets of Drdl islands can be generated for each BAG clone, one set 
consisting of AA, AG, AG, GA, OA, or GO overhangs, while the other set consists of 
TT, GT, GT, TG, TG, or GG overhangs. Although most sequences would be 
represented in both sets, each would rescue Drdl islands lost from the other set due to 

10 either the neighboring Taql or Mspl site being too close (resulting in amplification of 
a very short fragment which lacks the number of bases required to determine 
uniqueness) or too far (resulting in weak or no amplification of the longer fragment). 
In such a circumstance, the number of sequencing runs would double, but the number 
of useable sequences for alignments would also increase. For the example of the Met 

1 5 oncogene containing BAG on 7q3 1 , the first six linker set provides 3 singlet and 3 
doublet sequences. The second six linker set provides an additional 2 singlet and 3 
doublet sequences (See Figure 35). Using this very conservative approach, 3,600,000 
sequencing runs would be required: 

20 3,600,000/2,304 sequencing reads/machine/day = 1,770 machines days/200 machines 
= 8.8 days 

The above would provide about 80 bp of nchored sequence 
information for about 150,000 to 200,000 Drdl sites, spaced on average every 15-20 
25 kb. 

If the machine is run to read 200-300 bp, this equals 1,240 reads/day, 

then: 



30 



3,600,000/1,240 sequencing reads/machine/day 
- 14.6 days 



= 2,904 machines. days/200 machines 
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The above would provide about 200-300 bp anchored sequence 
information for about 150,000 to 200,000 Drdl sites, spaced on average every 
15-20 kb. 

If the machine is run to read 500-600 bp, this equals 760 reads/day, 

5 then: 

3,600,000/760 sequencing reads/machine/day = 4,736 machines days/200 machines = 
23.6 days 

10 Add to this sequencing, both ends of the 300,000 BAC clones (using 

unique primers to the two ends and bubble PGR) = 600,000/760 sequencing 
reads/machine/day = 790 machines days/200 machines = 3.9 days 

The above would provide about 500-600 bp anchored sequence 
information for about 150,000 to 200,000 Drdl sites, spaced on average every 15-20 

15 kb. This is approximately 75 million to 120 million anchored bases and is from a 
2.5% to 4% representation of the genome. With a 10-fold coverage, and reasonably 
clean reads, one should be able to identify about 100,000 to 170,000 anchored SNPs 
in 23.6 days. Further, the ends of the BAC clones will, providing sequencing reads of 
average length 325 bases for about 75% of the end, for an additional 145 million 

20 bases. The BAC end sequences are not completely anchored since one cannot 
determine orientation of the ends with respect to other BAC clones unless the end 
sequence fortuitously overlaps with another end sequence in the opposite orientation 
(predicted to occur 325/150,000 bp = 0.2% of the clones.) Nevertheless, the BAC end 
sequences are relatively anchored and will provide confirming sequence information 

25 once the random sequence from 10 kb insert clones are collected. The total of 28 

days sequencing will provide 7,5 to 9% of anchored and relatively anchored genomic 
sequence. 

Alternatively, one can create Drdl libraries of 5-pooled individuals 
DNA in pUC vectors to find the SNPs. As described previously, a size-selection of 
30 fragments between 200 and 1,000 bp will provide a 0.26% representation of the 
genome (average size of 580 bp; number of fragments is 19,700) for a single 
overhang. If the latter number is multiplied by 12 different overhangs, a 10-fold 
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coverage is provided, and both strands are sequenced, 20,000 x 12 x 10 = 2,400,000 
sequencing runs are obtained. 

2,400,000/760 sequencing reads/machine/day = 3,158 machines days/200 machines = 
5 15.8 days 

Thus, if the initial reads from the BAC libraries are exceptionally 
clean, then long reads of 500-600 bp may be used to create an anchored representation 
with 100,000 to 170,000 SNPs, and can be completed in 23.6 + 3.9 = 27.5 days. 
1 0 Alternatively, much shorter runs may be used for the initial BAC sequencing, and, 

then, higher quality runs may be used to extend the anchors and create a 200,000 SNP 
library in 8.8 + 15.6 + 3.9 = 28.3 days. 

Option 3: 2.400.000 short sequencing reactions generate approximately 150.000- 
15 200.000 B^a islands to create an entire BAC contig. 

One concept is to increase the number of anchored sites in a given 
BAC. The BgK restriction endonuclease generates a 3 base 3' overhang, but may also 
be used to create a representation (See Figure 14). Since the overhang is an odd 

20 number of bases, it is not necessary to exclude the palindromic two base sequences 
AT, TA, GC, and CG. To reduce the number of ligations from 64 (all the different 
possible 3 base overhangs) to 16, the linkers and primers are degenerate at the last 
position, i.e. end with a 3' ACN or AAN. (Please note: Greater specificity may be 
achieved by using the degeneracy at the 3' end of the linker adapter.) Since there are 

25 3 levels of specificity in the ligation and sequencing step (see Figure 36), the third 
base degeneracy will not interfere with the fidelity of the reaction. 

Again, v^th an average BAC size of 100-150 kb, a total of 20,000 to 
30,000 BAC clones would cover the human genome, or 300,000 clones would 
provide at least 1 0-fold coverage. For each clone, one requires 8 sequencing runs for 

30 a total of 2,400,000 sequencing reactions. Using the same assumptions as above: 

2,400,000/2,304 sequencing reads/machine/day = 1042 machines days/200 machines 
= 5.2 days 
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The above would provide about 80 bp anchored sequence information 
for about 1 50,000 to 200,000 Bgll sites, spaced on average every 15-20 kb. 

If the machine is run to read 200-300 bp, this equals 1 ,240 reads/day, 

5 then: 

2,400,000/1,240 sequencing reads/machine/day = 1,935 machines days/200 machines 
= 9.7 days 

10 The above would provide about 200-300 bp anchored sequence 

information for about 150,000 to 200,000 BgH sites, spaced on average every 15-20 
kb. 

If the machine is run to read 500-600 bp, this equals 760 reads/day, 

then: 

15 

2,400,000/760 sequencing reads/machine/day = 3,158 machines days/200 machines = 
15.8 days 

The above would provide about 500-600 bp anchored sequence 
20 information for about 1 50,000 to 200,000 Bgll sites, spaced on average every 1 5-20 
kb. 



Option 4: 4.800.000 short sequencing reactions generate approximately 200,000- 
250.QQQ BgH islands to create an entire BAC contie. 

25 

Should pilot studies suggest that some sequence reads are difficult to 
interpret, two sets of BgH islands can be generated for each BAC clone, one set 
consisting of AAN, CAN, GAN, TAN, AGN, CGN, GGN, or TGN overhangs, while 
the other set consists of ACN, CCN, GCN, TCN, ATN, CTN, GTN, or TTN 
30 overhangs. While most sequences would be represented in both sets, each would 

rescue Bgll islands lost from the other set due to either the neighboring Taql or Mspl 
site being too close (resulting in amplification of a very short firagment which lacks 
the number of bases required to determine uniqueness) or too far (resulting in weak or 
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no amplification of the longer fragment). In such a circumstance, the number of 
sequencing runs would double^ but the number of useable sequences for alignments 
would also increase. For the example of the Met oncogene containing BAC on 7q3 1 , 
the first eight linker set provides 5 singlet and 3 doublet sequences. The second eight 
5 linker set provides an additional 3 doublet sequences (See Figure 35). The set of non- 
palindromic linker adapters may be mixed, as long as the complement is not also 
included in the mixer. For example, to chose sites which will allow the PGR primers 
to end in only a C or A, the set of AAN, CAN, CAN, TAN, ACN, CCN, GCN, and 
TCN overhangs may be used (See Figure 35). This set allows design of PGR primers 

10 with 3' bases of eidier "A" or "G", which tend to give less miss-priming than primers 
with 3' "G" or "T", which may give false PGR amplification products resulting from 
polymerase extension of a T:G mismatched base. In this BAG, the TOT or AGA 
overhang appeared too frequently, suggesting it may be associated with a repetitive 
element. For the purposes of these calculations, the complete set of 16 linkers would 

1 5 require 4,800,000 sequencing runs, although less linkers would most probably suffice: 

4,800,000/2,304 sequencing reads/machine/day = 2083 machines days/20i3 machines 
= 10.4 days 

20 The above would provide about 80 bp anchored sequence information 

for about 200,000 to 250,000 BgH sites, spaced on average every 12-15 kb. 

If the machine is run to read 200-300 bp, this equals 1,240 reads/day, 

then: 

25 4,800,000/1,240 sequencing reads/machine/day = 3,871 machines days/200 machines , 
= 19.4 days 

The above would provide about 200-300 bp anchored sequence 
information for about 200,000 to 250,000 Bgll sites, spaced on average every 12-15 
30 kb. 

If the machine is run to read 500-600 bp, this equals 760 reads/day, 

then: 



wo 00/40755 



-91 - 



PCTAJSOO/00144 



4,800,000/760 sequencing reads/machine/day = 6,316 machines days/200 machines = 
31.6 days 

5 The above would provide about 500-600 bp anchored sequence 

information for about 200,000 to 250,000 BgH sites, spaced on average every 12-15 
kb. 

Add to this sequencing both ends of the 300,000 BAC clones (using 
unique primers to the tw^o ends and bubble PGR) = 600,000/760 sequencing 

10 reads/machine/day = 790 machines days/200 machines = 3.9 days 

The above would provide about 500-600 bp anchored sequence 
information for about 200,000 to 250,000 BgFL sites, spaced on average every 12-15 
kb. This is approximately 100 million to 150 million anchored bases and is from a 3% 
to 5% representation of the genome. With a 10-fold coverage, and reasonably clean 

1 5 reads, one should be able to identify about 1 30,000 to 200,000 anchored SNPs in 3 1 .6 
days. Further, the ends of the BAC clones will provide an additional 145 million 
bases of relatively anchored sequences. The total of 36 days sequencing will provide 8 
to 10% of anchored and relatively anchored genomic sequence. 

As described above, one can create Bgll libraries of 5-pooled 

20 individuals DNA in pUC vectors to find the SNPs. A size-selection of fragments 
between 200 and 1 ,000 bp will provide a 0.26% representation of the genome for a 
single overhang (about 20,000 fragments). If the latter number is multiplied by 16 
different overhangs, a 10-fold coverage is provided, and both strands are sequenced, 
there are 20,000 x 16 x 10 = 3,200,000 sequencing runs. 

25 

3,200,000/760 sequencing reads/machine/day = 4,210 machines days/200 machines = 
21.0 days 

Thus, if the initial reads from the BAC libraries are exceptionally 
30 clean, then long reads of 500-600 bp may be used to create an anchored representation 
with 130,000 to 200,000 SNPs, and can be completed in 3 1 .6 + 3.9 = 35.5 days. 
Alternatively, much shorter nms may be used for the initial BAC sequencing, and 
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then higher quality runs may be used to extend the anchors and create a 250,000 SNP 
library in 10.4 + 21.0 + 3.9 = 35.3 days. 

Option 5: 4.200.000 short sequencing reactions generate approximately 250.000- 
5 300.000 Drdl and Bgll islands to create an entire BAG contig. 

An alternative strategy is to combine the best of both representations, 
using 6 nonrpalindromic linker-adapters for the Drdl overhangs, and 8 non- 
palindromic linker-adapters for the Bgll overhangs (see Figure 37.) If the multiplex 

1 0 PGR of 14 different linker-adapter sets does not amplify all fragments in sufficient 
yield, then the BAG DNA may be aliquoted initially into two or more tubes. Further, 
unique primer sets may be used to increase yield of a PGR fragment prior to the 
sequencing reaction. The advantages of such a hybrid representation is that it 
maximizes the distribution of independent sequence elements. As noted above, 

1 5 should any Drdl or Bgll site be frequently found in repetitive elements, that overhang 
can be removed from the representation. For the full representation, the hybrid 
approach uses 6 + 8 = 14 sequencing runs for each BAG: 

4,200,000/2,304 sequencing reads/machine/day = 1,823 machines days/200 machines 
20 =9.1 days 

The above would provide about 80 bp anchored sequence information 
for about 250,000 to 350,000 Drdl and Bgll sites, spaced on average every 8-12 kb. 

If the machine is run to read 200-300 bp, this equals 1,240 reads/day, 

25 then: 

4,200,000/1,240 sequencing reads/machine/day = 3,387 machines days/200 machines 
= 16.9 days 



30 



The above would provide about 200-300 bp anchored sequence 
information for about 250,000 to 350,000 Drdl and Bgll sites, spaced on average 
every 8-12 kb. 
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If the machine is run to read 500-600 bp, this equals 760 reads/day, 

then: 

4,200,000/760 sequencing reads/machine/day = 5,526 machines days/200 machines = 
5 27.6 days 

The above would provide about 500-600 bp anchored sequence 
information for about 250,000 to 350,000 Drdl and BgK sites, spaced on average 
every 8-12 kb. This is approximately 125 million to 210 million anchored bases and 

10 is from a 4.2% to 7% representation of the genome. With a 10-foId coverage, and 
reasonably clean reads, one should be able to identify about 1 80,000 to 300,000 
anchored SNPs in 31.6 days. Further, the ends of the BAG clones will provide an 
additional 145 million bases of relatively anchored sequences. The total of 32 days 
sequencing will provide 9.2 to 12% of anchored and relatively anchored genomic 

15 sequence. 

As described above, one can create BgR libraries of 5-pooled 
individuals' DNA in pUC vectors to fmd the SNPs. A size-selection of fragments 
between 200 and 1,000 bp will provide a 0.26% representation of the genome for a 
single overhang (about 20,000 fragments). If the latter number is multiplied by 16 
20 different overhangs, a 10-fold coverage is provided, and both strands are sequenced, 
20,000 X 14 X 10 = 2,800,000 sequencing runs are obtained. 

2,800,000/760 sequencing reads/machine/day = 3,684 machines days/200 machines = 
18.4 days 

25 

Thus, if the initial reads from the BAG libraries are exceptionally 
clean, then long reads of 500-600 bp may be used to create an anchored representation 
with 180,000 to 300,000 SNPs, and can be completed in 27.6 + 3.9 = 31.5 days. 
Alternatively, much shorter runs may be used for the initial BAG sequencing, and 
30 then higher quality runs may be used to extend the anchors and create a 240,000 SNP 
library in 9.1 + 18.4 + 3.9 = 31 .4 days. In summary, a month and a day of 
sequencing on 200 machines will provide a valuable database containing anchored 
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and mapped sequence islands of 500-600 bases on average every 8-12 kb with 
approximately 240,000 mapped SNP's. 

IV. Creating a Drdl Island Database of Mapped SNPs and Using a Universal 
5 DNA Array for High Throughput Detection of SNPs. 

Use of the Drdl Isl and Approach for Alignment of Plural Clones 

Figures 38 to 45 show how the Drdl island approach of the present 

10 invention can be utilized to align 4 hypothetical BAC clones containing 8 to 12 non- 
palindromic Drdl sites. In this example, the 6 linkers with the Group II dinucleotide 
overhangs (i.e. AG, AC, CA, GA, AA, and GG) are used. The Drdl sites are labeled 
la, lb, 1 c . . . 2a, 2b, .... up to 6a, 6b, ... . The numeral represents the type of non- 
palindromic 2 base overhang for that Drdl site: 1 = AA, 2 = AC, 3 = AG, 4 = CA, 5 

1 5 = GA, and 6 = GG. The lower-case letter represents the first = a, second = b, third = 
c, and so on, for each unique sequence with that particular non-palindromic 2 base 
overhang. As described more fully below, each of the 6 linkers generates a separate 
representation of overlapping islands on the 4 different BAC clones. When the 
different representations obtained with each linker in the Drdl island analysis are 

20 combined, the alignment of the BAC clones can be determined. 

In each of Figures 38-44, the top panel illustrates the actual position of 
each Drdl site within each BAC, the Drdl island data generated from each of these 
BAC clones is provided in the table below. After obtaining sequence information in 
each clone, one compares the sequences in each column and determines if the two 

25 entries are concordant or discordant as described supro. The BAC clones overlap if 
the entries in that column are concordant. The BAC clones do not overlap if all the 
entries in all the columns are discordant. Since a large scale sequencing project will 
produce from about 30,000 to 90,000 entries in each column, virtually all the clones 
will be discordant with each other, only a few will overlap with each other at a given 

30 point in the contig. The number of different ways to establish overlap between two 
BAC clones is considerable. 

In Figure 38, the Drdl island approach is used to determine sites with 
AA overhangs. When the procedure described supra with respect to Figure 1 is 
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carried out, for AA overhangs, BAG clone I is found to have a triplet, BAG clone II 
has a doublet, BAG clone III has a doublet, and BAG clone IV has a singlet. Based 
on these results and dideoxy sequencing, the Drdl islands in these clones are found to 
have 5 different sequences with AA overhangs (i.e. sequences la to le) at defined 
5 positions in 1 or more of the 4 BAG clones, as shown in Figure 38. Based on this data 
alone, concordances (i.e. an indication that 2 or more clones are contiguous) are found 
between clones I and III (i.e. sequence lb in the triplet in clone I and the doublet of 
clone III), clones II and III (i.e. sequence le in the doublet in clone II and the doublet 
of clone III), clones III and IV (i.e. sequence le in the doublet in clone III and the 

10 singlet of clone IV), and clones II and IV (i.e. sequence le in the doublet in clone II 
and the singlet of clone IV). On the other hand, discordances (i.e. an indication that 2 
or more clones are not contiguous) are found between clones I and II (i.e. there is no 
overlap between the la, lb, and Ic sequences of clone I and the lb and le sequences 
of clone II) and clones I and IV (i.e. there is no overlap between the la, lb, and Ic 

15 sequences of clone I and the le sequences of clone IV). Based on the identification of 
these concordances and discordances, a tentative alignment for some of clones I to IV 
can be determined, as shown at the bottom of Figure 38. 

Figure 39 shows how the Drdl island approach is used to determine the 
sequences of sites with AG overhangs and, based upon this information, to tentatively 

20 align the 4 hypothetical BAG clones. Using the analysis described above with respect 
to Figure 38, but for the AG overhangs, 3 concordances and 2 discordances are 
identified and the tentative alignment of the 4 hypotlietical BAG clones is determined, 
as shown in Figure 39. As noted above, the results of Figure 38 identified 
concordance among BAGS I through IV based on overlapping sequences. However, 

25 as shown with respect to Figure 39, a concordance caimot be deduced between BAG I 
and III, since there are no overlaps in the identified sequences. 

Figure 40 shows how the Drdl island approach is used to determine the 
sequences of sites with AG overhangs and, based upon this information, to tentatively 
align the 4 hypothetical BAG clones. Using the analysis described above v^th respect 

30 to Figure 38, but for the AG overhangs, 2 concordances and 2 discordances are 

identified and the tentative alignment of the 4 hypothetical BAG clones is determined. 
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as shown in Figure 40. Overlap between BAG 11 & III, or BAG III & IV could not be 
deduced using the AG overhang site alone. 

Figure 41 shows how the Drdl island approach is used to determine the 
sequences of sites with CA overhangs and, based upon this information, to tentatively 
5 align the 4 hypothetical BAG clones. Using the analysis described above with respect 
to Figure 38, but for the GA overhangs, 4 concordances and 2 discordances are 
identified and the tentative alignment of the 4 hypothetical BAG clones is determined, 
as shown in Figure 41. 

Figure 42 shows how the Drdi island approach is used to determine the 

1 0 sequences of sites wdth GA overhangs and, based upon this information, to tentatively 
align the 4 hypothetical BAG clones. Using the analysis described above with respect 
to Figure 38, but for the GA overhangs, 1 concordance and 2 discordances are 
identified and the tentative alignment of only 2 of the 4 hypothetical BAG clones is 
determined, as shovm in Figure 42. 

1 5 Figure 43 shows how the Drdl island approach is used to determine the 

sequences of sites vsdth GG overhangs and, based upon this information^ to tentatively 
align the 4 hypothetical BAG clones. Using the analysis described above with respect 
to Figure 38, but for the GG overhangs, no concordances and 1 discordance are 
identified and the tentative alignment of the 4 hypothetical BAG clones cannot be 

20 determined, as shown in Figure 43. In Figure 43, there is a doublet in clone I based 
on the presence of sequences 6a and 6b, a singlet based on the presence of sequence 
6c, and a multiplet in clone III based on the presence of sequences 6a, 6b, 6c, and 6d. 
In view of multiplet in clone III, the sequence of the Drdl island GG overhangs 
cannot be determined. However, a set of 4 sequencing primers can be used to extend 

25 one base beyond the GG overhang (i.e. the 3' end of the primers contains GGA, GGG, 
GGG, and GGT) to obtain additional information. However, it is not necessary to do 
so in this case, because the data for the other overhangs shows that concordance exists 
between clones I and III and between clones III and IV. 

The analyses conducted in conjunction with Figures 38 to 43 can be 

30 combined to obtain a listing of the sequences obtained for each of the dinucleotide 
overhangs, a listing of the concordances, and a listing of the discordances, as shovm 
in Figure 44. Based on this information, the unique and overlapping Drdl islands in 
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the 4 hypothetical BAG clones can be identified and the clones themselves aligned, in 
accordance with Figure 45. in this hypothetical, as illustrated, the order of the clones 
is as follows: I, HI, IV, and II. This result was determined on a very conservative 
basis. For example, although sequence 6c is unique to clone IV, the multiplet of GG 
5 sequences in clone III precludes an unambiguous assignment for the position of this 
sequence. Also, the listing does not order the Drdl sites which are unique to a given 
clone. Finally, one can arrange the information to achieve a contig of the map 
position of the DrcH sites which correspond to the individual BAG clones. The Drdl 
sites are grouped into 6 sets allowing a rough determination of the BAG clone . 
1 0 alignment. Certain sites remain unmapped, such as 6c - although one may surmise 
that it probably overlaps with clone III, since clone II lacks a Drdl site with a GG 
overhang. The precise order of Drdl sites within a grouping cannot be determined 
from this data alone, but will be easily obtained from sequence infonnation on smaller 
cosmid clones, once the BAG contig is completed. 

15 

Examples of alignment of human DNA BAG contigs using Drdl islands 

The simulations in the previous section demonstrate how the Drdl 
alignment is achieved. BAG overlaps in the genome databases were rare. The 

20 following are examples from 3 contigs on chromosome 7. Figure 46 shows 

representational fragments which would be obtained with Drdl/Mspl/Tagl digests. 
Figure 47 shows representational fragments which would be obtained with Drdl/Msel 
digests. The fragments which allow one to establish overlap have appropriate 
symbols next to them to show that they are in more than one BAG. 

25 For an example using Drdl/Mspl/Tagl digests, contig 1941 contains 3 

BAGs. BAG RG253B13 overlaps with RG013N12 based on the DrdUMspl/Taql 
fragments generated from Drdl AG (1 15 and 353 bp), AC (381 bp), GA (559 bp), GA 
(3,419 bp; may not amplify) and AA (192 and 597 bp) overhangs. BAG RG013N12 
overlaps with RG300GO3 based on the Drdl/Mspl/Taql fragments generated from 

30 Drdl AG (1,137 bp), GA (16 bp, may be too small), and AA (2,328 bp). 

For example, using Drdl/Msel digests, contig T002144 contains 5 
BAGs. BAG RG022J17 overlaps with RG067E1 3 based on the Drdl/Msel fragments 
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generated from Drdl AG (338bp), GA (17, 77, and 586 bp), AA (273 bp), and GG (55 
bp) overhangs. BAG RG067E13 overlaps with RGOl 1 J21 based on the Drdl/Msel 
fragments generated from Drdl AC (71 bp). BAG RGOl 1J21 overlaps with 
RG022C01 based on the Drdl/Msel fragments generated from Drdl AG (92bp), AA 
5 (48 bp), and GG (42 bp) overhangs. Note that establishing overlap between 

RG022C01 and RGO43K06 would require either using the odier Drdl overhangs (in 
this case TT) or, alternatively, having more BACs in the library. 

9Q0.0QQ short sequencing reactions will be needed to create an entire BAG contig 
10 using the Drdl islands approach: completed in 39 days using 10 of the Perkin Elmer 
3700 machines. 

As described above, the Drdl island procedure is amenable to 
automation and requires just a single extra reaction (simultaneous cleavage/ligation) 

1 5 compared to dideoxy sequencing. Use of 6 additional primers is compatible with 
microliter plate format for delivery of reagents (6 at a time). Further, only very short 
sequences of 80 to 100 bases are more than sufficient to determine concordance or 
discordance with other entries into the database. 

With an average BAG size of 100-150 kb, a total of 20,000 to 30,000 

20 BAG clones would cover the human genome, or 150,000 clones would provide. 5-fold 
coverage. For each clone, one requires 6 sequencing runs for a total of 900,000 
sequencing reactions. At a conservative estimate of 1 run per hour of 96 reactions, 
with 24 loadings/day, this equals 2,304 sequencing reads/PE 3700 machine/day. 

Thus, the Drdl approach for overlapping all BAG clones providing a 5- 

25 fold coverage of the human genome would require only 39 days using 1 0 of the new 
PE 3700 DNA sequencing machines. 

The complete set of Drdl islands provided six sets to determine 
overlap. The number of islands within a BAG can be increased by using a second 
representation, such as Sg/I. Further, this example used only 4 hypothetical clones 

30 with minimal coverage, in the actual human genome sequencing, there will be a 1 0- 
fold coverage of the genome. The precise order of Drdl sites vWthin a grouping 
cannot be determined from this data alone, but will be easily obtained from sequence 
information on smaller 10 kb plasmid clones, once the BAG contig is completed. 
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Completing the entire genome sequence b ased on the BAC Drdl and BgH islands. 

The total unique sequence in the hybrid Drdl-Bgll island database will 
5 be approximately 125 million to 210 million anchored bases with an additional 145 
million bases of relatively anchored sequences from the BAC ends. This will provide 
9.2 to 12% of anchored and relatively anchored genomic sequence, or approximately 
1/1 0*'^ of the entire genome will be ordered on the human genome. This is sufficient 
density to allow for a shotgun sequencing of total genomic DNA from the ends of 10 
1 0 kb clones. The shotgun cloning will require only a 5-fold coverage of the genome 
since the islands are relatively dense. At an average of 1 kb reads (i.e. 2 sequencing 
reactions of 500 bp/clone), 3,000,000 clones would provide 1-fold coverage and 
1 5,000,000 clones would provide a 5-fold coverage. Since sequence information will 
be obtained from both ends, the process will require almost 200 days. 

15 

30,000,000/760 sequencing reads/machine/day = 39,473 machines days/200 machines 
= 197 days 

On average, each 10* clone will immediately overlap with one of the 
20 ordered island sequences in the above database {9.2 to 12% of genome). Overlap is 

determined using unique sequences near the ends of a given island. An algorithm is ^ 
designed to choose 32 unique bases on each side of the island which is not part of a 
repetitive sequence. This 32 base sequence will be designated a "Velcro island". 
Thus, for the 250,000 to 350,000 Drdl and Bgll ordered islands in the database, there 
25 will be between 500,000 and 700,000 "Velcro islands". As sequence information is 
generated, it is queried in 32 bit portions to see if it has either perfect 32/32 or almost 
perfect 31/32 alignment with one of the Velcro sequences. If yes, then the 
neighboring 20 bases on each side (if available) are also queried to determine if this is 
a true overlap. When this overlap is achieved, it generates 3 new "Velcro islands" 
30 and removes one of them from the database. One of the new Velcro islands is the 
distal sequence on the 500 bases which overlap with the original Drdl island. The 
other two new Velcro islands are the end portions of the 500 base sequence attached 
to this particular clone, either approximately 10 kb upstream, or downstream of the 
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Drdl island, depending on orientation. If any of the new Velcro regions is in a repeat 
sequence, it is removed from the Velcro database. This reduces formation of false 
contigs. These two new Velcro islands are immediately queried against all other Drdl 
and BgH islands in the BAC contig region. In the example in Figures 42-43, islands 
5 le, 2c, and 4c all map to the same contig region. This type of analysis is repeated 
with each new random plasmid sequence, thus initially creating more Velcro islands, 
and subsequently creating less Velcro islands as the genomic sequence fills in. Each 
genome equivalent will hit from 80% to 90% of the Velcro islands, expanding each 
island by an average of 500 bases, + a bridge of another 500 bases or about 400 to 600 

10 million bases. Thus, on a first pass, ordered information should increase from about 
9%-12% to about 21%-32% the genome. The remaining clones are rescanned into 
the new Velcro database, which now has from 2 to 2.5-fold more islands, allowing 
more cormectivity points which now increase to about 800 to 1,200 million bases, or 
about 47%-72% the genome and with a third and fourth pass, this approach leads to a 

1 5 complete sequence of the entire genome. The genome is substantially filled in by the 
5-fold coverage. 

Constaiction of a finished genomic sequence over a 1 megabase region 
was simulated using a random number generator which provided sequence read start 
points for 5,000 "random" clones, with the assumption that each start point provided 

20 500 bases of sequence. To each of these, another 500 bases of sequence was included 
at a random distance of 8 to 12 kb downstream. The randomly generated sites were 
sorted by position and queried for presence of sequencing gaps. This was based on 
the conservative requirement for 40 bp overlap between two sequence runs. Thus, 
sequence start points more than 460 bases apart were scored as gapped. Two types of 

25 gaps need to be considered: (i) Gaps in sequence information between the two 500 
bases generated fi'om a random clone, which will be filled in as needed, and (ii) Gaps 
between two imrelated clones which are not bridged. In the 1 megabase region, there 
were 74 small gaps which were in-between a given clone. Of these, 50 gaps were 
between 460 and 560 bases, i.e. less than 100 bases fi'om the nearest anchored 

30 sequence. Thus, extending the sequencing read from 500 to 600 bases would close 
these 50 regions. The remaining 24 sites are less than 500 bp away from an anchored 



wo 00/40755 



-101 - 



PCT/US00/O0I44 



site and can be filled in when the region in question is being closely scrutinized for 
important genes. 

The 1 megabase region also contained 26 gaps in between two 
unrelated clones which were not bridged. Of these, 2 1 were between 460 and 560 
5 bases, i.e. less than 100 bases from the nearest anchored sequence. Thus, extending 
the sequencing read from 500 to 600 bases would close these 21 regions. The 
remaining 6 sites need to be filled in using primer walking. Five of these sites were 
within 500 bp, and the remaining site was within 1,000 bp - thus, each of these 
regions can be closed using sequencing primers from both sides of the anchored 

1 0 sequence. The same primers are used to PGR amplify the region from the genome 
and then sequence it. On average, 12 sequencing/PCR primers will be required to 
close 6 gaps per megabase. For the entire human genome at 3,000 megabases: 3,000 
X 12 = 36,000 primers and sequencing runs. There are a number of commercial 
vendors synthesizing primers, many of whom claim capacity of "1,000's of oligo's 

1 5 per day", so at a conservative estimate of 2,000 primers/day @ S20/primer, the 
synthesis run would require 1 8 days. 

36,000/760 sequencing reads/machine/day = 47 machines days/200 machines = 0.23 
days 

20 

The grand total is: 

Mapped Drdl and BgH islands with over 200,000 SNPs; 10-fold coverage of B AGs 
w/ends = 31.5 days 

25 Random lOkb plasmid clones; 5-fold coverage of entire genome = 197 days 
Glosure of gaps using primer walking = 18.5 davs 

Total: =247 days 

30 

BAG clone derived singlets are used to align plasmid Drdl islands to generate a 

gQmprehen^ive Drrfl SNP d^t^b^se, 

The singlet sequences deduced from deconvoluting the BAG clone 
35 contig database (see above) will be used to align more complete Drdl islands 
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generated by sequencing in both directions from cosmid or plasmid clones. About 
200,000 to 300,000 Drdl islands are predicted in the human genome. The Drdl 
islands are a representation of 1/15 ^ to 1/10^ of the genome. 

As described above, 500,000 plasmid or cosmid clones of average size 
5 30-40 kb will provide 5 to 6-fold coverage of the human genome. These plasmids and 
cosmids will be generated from a mixture of 10 individual's DNA to provide a rich 
source of SNPs. Initially, only 6 primers will be used per plasmid/cosmid to identify 
those Drdl sites present in the clone. A subsequent run will be performed with the 
correct overhang linkers for generating the sequence of the opposite strand for those 

10 Drdl sites present in that clone, as well as using more selective primers for obtaining 
unique sequence information from doublet or triplet reads. An average of 3 sites per 
clone v^U rapidly generate 1,500,000,000 bases of sequence information from the 
Drdl sites, plus the 500,000,000 bases of unique sequence information from the ends 
of the clones. The 1,500,000,000 bases of sequence information from the Drdl sites 

1 5 will contain the same regions resequenced an average of 5-6 times providing 

250,000,000 to 300,000,000 bases of unique sequence and ample amounts of SNP 
information. This comprehensive Drdl island approach will require on average 12 
sequencing runs per clone to determine the unique singlet Drdl sequences, for a total 
of 6,000,000 sequencing runs. 

20 This comprehensive Drdi island approach will provide from 250,000 

to 430,000 SNPs. It has been estimated that 30,000 to 300,000 SNPs v^ll be needed 
to map the positions of genes which influence the major multivariate diseases in 
defined populations using association methods. Further, the above SNP database will 
be connected to a closed BAC clone map of the entire genome. A more rapid 

25 approach to finding SNPs is provided below. 

A novel shotgun ap proach to generate a mapped Drdl SNP database, which is 
amenable to high-throughput detection on a DNA array, 

30 In the above-described procedure for PCR-amplifying the Drdl island 

directly from a BAC clone by using a second frequent cutter enzyme to create small 
fi^gments for amplification was described. The second enzyme (e.g. Mspl) can 
contain a two base 5' overhang such that ligation/cutting could proceed in a single 
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reaction tube. The ligation primers/ PGR primers can be designed such that only 
Drrfl-second enzyme fragments amplify. 

A detailed evaluation of 4 sequenced BAG clones from 7q3 1 shows 
that ideally, the second enzyme should be a mixture of both Taql and MspL 
5 Tagl is known to retain some activity at 37°G, and, thus, the entire 

reaction containing DNA, adapter linkers, Drdl, Tagl, Mspl, and T4 ligase may be 
carried out in a homogeneous reaction at 37**G. Further, Tagl becoiiies irreversibly 
denatured at 75°G. Therefore, a heat step prior to the PGR reaction is sufficient to 
inactivate all the enzymes. 

10 A close analysis of the length of fi-agments generated in a Drdl. Tagh 

and Mspl cleavage/ligation/amplification reveals that not every Drdl site is amplified 
(on the assumption that firagments above 4 kb will not amplify well in a mixture 
containing much smaller amplicons.) Further, in a competition, where one fragment 
is small (i.e. 200 bp) compared to a much larger fragment (i.e. 2,000 bp), the smaller 

1 5 one will generate more PGR product, which may be sufficient to swamp out the 

sequencing ladder in the first 200 bases. Ironically, this only aids in the analysis of 
the sequence information, because comparisons of singlet with singlet reads is the 
easiest to interpret. 

In one BAG clone, RG364P16, the Drdl sites are positioned such that 

20 the AA, AC, AG, GA, GA, and GG overhangs used in the linker would generate only 
3 fragments below about 4,000 bp. Actually, the first site would generate an 
additional product to a Ja^I or Mspl site vwthin the BAG vector. See Figure 48. 
Even three sites are sufficient to determine clone overlap. Nevertheless, if needed, 
linkers containing the complement TT, GT, GT, TG, TG, and GG overhangs would 

25 provide additional sequences at some of the other Drdl sites. 

For creating the representation required for shotgun cloning, 1 ng of 
pooled genomic DNA (200 ng each from 5 individuals = 10 chromosome equivalents) 
= 150,000 copies of the genome = 0.25 attomoles of genomes or 0.5 attomoles of each 
gene is used. This procedure is shown in Figure 49 and is largely the same as that 

30 described with reference to Figure 5, except after PGR amplification, the PGR product 
is cut with Xmal and Xhol enzymes. The resulting digested product is separated on a 
gel. The fragments of 200 to 1000 bp are cloned into the corresponding sites of a 
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vector. The inserts can be sequenced to build a mapped SNP database. This 
procedure is described in more detail below. 

The pooled DNA is cut with Drdl, Taql, and Mspl, in the presence of 
phosphorylated Drdl adapters containing a unique 2 base 3' overhang (i.e. AA) as 
5 well as a methylated XmaWl site (C"™^CCGGG) in the adapter sequence, in the 
presence of xmphosphorylated Taq\ and Mspl adapters containing 2 base 5* CO 
overhangs as well as a methylated Xho\ site (CTCG '"^AG) in the adapter sequence, 
and in the presence of T4 ligase, such that the linkers are added to their respective 
overhangs in a homogeneous reaction at 37®C. The adapters are methylated so they 

10 are not cut by Taql and Mspl during this reaction. Enzymes are inactivated by heating 
at SS^'C to 98'*C, preferably 95^*0, for 2 to 20 minutes, preferably for 5 minutes. 

Alternatively, the MspVTaql adapter is phosphorylated, contains a 3' 
blocking group on the 3' end of the top strand, and contains a bubble to prevent 
amplification of unwanted Mspl-Msply Taql-Mspl, or Taql-Taql fragments. While 

1 5 the linker can ligate to itself in the phosphorylated state, these linker dimers will not 
amplify. Phosphorylation of the linker and use of a blocking group eliminates the 
potential artifactual amplification of unwanted Mspl-Mspl, Taql-Mspl, or Taql-Taql 
fragments. T4 ligase attaches the Drdl and MspllTaql linkers to their respective sites 
on the human genome fragments with biochemical selection assuring that most sites 

20 contain linkers (See Figure 49A). The adapters are methylated so they are not cut by 
Taql and Mspl during this reaction. 

Urunethylated PGR primers are now added in excess of the adapters 
and used for PGR amplification of the appropriate fragments. Of the approximately 
50,000 Drdl sites, approximately 70% will give fi-agments under 4 kb (based on the 

25 computer simulation of Drdl sites on 4 BAG clones, where 27/38 non-palindromic 
Drdi sites had Taql or Mspl sites within 4 kb). Thus, about 35,500 fragments will be 
amplified. Again, from the simulations, where fragments totaling 24.8 kb are 
amplified from 550 kb of BAG clone DNA which is 4.5% of the genome, given that 
only 1/6^ of those fragments are amplified in a unique overhang representation which 

30 is 0.75% representation of the genome. However, for size-selected fragments of 
between 200 and 1,000 bp, only 15/38 fragments, representing a total of 8.7 kb are 
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amplified from 550 kb of BAC DNA, and 1/6^ of this which is 0.26% representation 
of the genome (average size of 580 bp; number of fragments is 19,700). 

A limited PGR amplification of 11-12 cycles (assuming 90% 
efficiency per cycle) will give a good representation and produce about 2 pg of final 
5 mixed fragments product in the 200- 1 ,000 bp range, without a major distortion or 
bias of the representation. The mixed fragments are separated on an agarose gel (i.e. 
low melting agarose from Seakem) the correct size fragment region excised, purified 
by standard means, and then cleaved with Xmalll (heteroschizomer of Smal) and Xhol 
and inserted into the corresponding sites in a standard vector, such as pUCl 8. The 

1 0 library will contain multiple copies of the approximately 1 9,700 fragments in the 

representation. The above procedure can be modified such that the library will contain 
more or less fragments in the representation. For example, a size-selection between 
200 and 2,000 bp will slightly increase the library to approximately 25,000 fragments 
in the representation. For making larger libraries, more than one linker for the Drdl 

1 5 site overhang may be used, e.g. both AA and AC overhangs would double the library 
to approximately 40,000 firagments in the representation. All the non-palindromic 
overhangs which are non-complementary (i.e. AA, AC, AG, CA, GA, GG) may be 
used to make an even larger library of approximately 120,000 fragments in the 
representation. For making smaller libraries, a PCR primer with one or two additional 

20 selective bases on the 3' end is used during the PCR amplification step. For example, 
use of a Drdl site linker with an AA overhang and a PCR primer with an AAC 3' end 
overhang would reduce the library to approximately 5,000 fragments in the 
representation. The ideal size of the hbrary will depend on the sequencing capacity of 
the facility (See Table 7). Other restriction endonucleases with degenerate overhangs 

25 as the primary enzyme may be used to create the representational library, such as 
5g/I, Drain, AlwNl ?/ZMI, Accl, BsiHiiAU SanDl SexAl, Ppul, Avail, £coO109, 
Bsu36l, BsrDl, Bsgl, Bpml, Sapl, or an isoschizomer of one of the aforementioned 
enzymes. Palindromic restriction endonucleases may also be used to create the 
representational library, such as BamUly Avrll, Nhel, Spel, Xbal, Kpril, Sphl, Aatll, 

30 Agel, Xmal, NgoMl, BspEl, Mid, Sacll, Pstl, ApaLl, or an isoschizomer of 

one of the aforementioned enzymes. 
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Tabic 7. Shotgun cloning of Drdl representation. 



DrdI 

Type 


Frequency 

in Genome 


Fragnjcni 
Siz^ fkbp) 


# Amplified 
Sequences 


U SNPs in 
Sequences 


Fraction of 
Ocngme . 


AAC 


12.500 


0,2-1 


5,000 


4.100 


0.07 % 


AAC, AAA 


25.000 


02'\ 


9,850 


8.200 


0.13% 


AA 


50.000 


02-\ 


19,700 


16.400 


0^6% 


AA, AC 


100.000 


0.2-1 


39,400 


32.800 


0.52 % 


6 overhangs 


300.000 


02-\ 


118.200 


98.400 


1.56% 



10 

When using shotgun cloning to ampHfy genomic Drdl representations 
for SNP discovery, it is critical that the amplification procedure does not introduce 

1 5 false SNPs from polymerase errors during amplification. The use of proofreading 
polymerases such as PJii polymerase should minimize such errors. When creating 
representational libraries with primer selectivity using a proofreading polymerase, use 
of probes vdth 3' thiophosphate linkages is preferred to avoid removal of selective 
bases from the primer. 

20 An alternative approach to minimize false SNPs is to pre-select the 

representational fragments, and/or avoid amplification altogether. This may be 
achieved by using biotinylated linker/adapters to a specific Drdl overhang, followed 
by purification of only those fragments using streptavidin beads. Such primer 
sequences are listed in Table 8. 



25 
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Table 8. Drell and MsplTaq Bubble linkers and PGR primers for 
representational shotgun cloning. 

Primer Sequence {S' ^2' ) " 

_ 

5' Biotin-Cl8 spacer- GAA TAC CCG GGA TGA CTA CGT 
GTA A 3' (SEQ. ID. No. 40) 

5' pA CAC GTA GTC ATC CCG GGT ATT C 3' (SEQ. ID. 
No. 41) m 



DAAl 
DAA2R 



DAAP3 5' GAA TAC CCG GGA TGA CTA CGT GTsA sA 3 ' (SEQ. 

ID. No. 42) 

m 

DAC5 5' Biotin-C18 spacer- GAT ACC CGG GAT GAG TAC GAC A 

3' (SEQ. ID. No. 43) 

DAC6R 5' pT GTC GTA CTC ATC CCG GGT ATC 3' (SEQ. ID, 

No. 44) tn 



DACP7 5' GAT ACC CGG GAT GAG TAC GAC AsAsC 3' (SEQ. ID. 

No. 45) 

m 

DAG9 5' Biotin-C18 spacer- GAT ACC CGG GAT GAG TAC GTC 

AAG 3' (SEQ. ID. No. 46) 

DAGIOR 5' pT GAC GTA CTC ATC CCG GGT ATC 3' (SEQ. ID. 

No. 47) m 



DAGPll 



5' GAT ACC CGG GAT GAG TAC GTC AsAsG 3' (SEQ, ID. 
No. 48) 



DCA13 



5' Biotin-C18 spacer- GAT TAC CCG GGA TGA CTA CGT 
ATC A 3' (SEQ. ID. No. 49) 



DCAGAGG141822R 



5' pA TAC GTA GTC ATC CCG GGT AAT C 3' (SEQ. ID. 
No. 50) m 



DCAP15 



5' GAT TAC CCG GGA TGA CTA CGT ATsCs A 3' (SEQ, 
ID. No. 51) 



DGA17 



tn 



5' Biotin-C18 spacer- GAT TAC CCG GGA TGA CTA CGT 
ATG A 3' (SEQ. ID. No. 52) 



DGA19 



5' GAT TAC CCG GGA TGA CTA CGT ATsG sA 3' (SEQ. 
ID. No. 53) 
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DGG21 


m 

5' Biotin-C18 spacer- GAT TAG COG GGA TGA CTA CGT 
ATG G 3' (SEQ. ID. No. 54) 


DGGP23 


5 ' 

ID, 


GAT TAG CCG GGT AGA CTA CGT ATsG sG 3' (SEQ. 
. No. 55) 


MTCG225 


5' 

No , 


GAC ACG TCA CGT CTC GAG TCC TA 3' fSEO. ID. 
, 56J 


MTCGp326R 


5' 

No. 


pCGT AGG ACT CAC AAC GTG ACG T - Bk (SEO. ID. 
. 57) 


MTCG0326R 


5' 

No, 


car AGG ACT CAC AAC GTG ACG T - Bk (SEO. ID. 
. 58) 


MTCG227 


5' 

No, 


GAC ACG TCA CGT CTC GAG TCC TsAsC 3' (SEO. ID. 
. 59) 


MTCG228 


5' 

No, 


GAC ACG TCA CGT CTC GAG TCC TAC 3' (SEO. ID. 
. 60) 



Using sufficient starting DNA, the representations may be generated by ligating on 
biotinylated linkers, removing unreacted linkers, for example, by ultrafiltration on an 
5 Amicon YM30 or YM50 filter, and, then, binding only those representational 

fragments containing the ligated biotinylated linker to streptavidin magnetic beads. 
After a 30 min. incubation with constant agitation, the captured fragments are purified 
by magnetic separation, and, then, the complementary strand is melted off the 
biotinylated strand at 95*^C for 30 seconds and rapidly recovered. The single-stranded 
10 DNA is converted to double stranded DNA (without methyl groups) using a few (2-5) 
rounds of PGR with a proofreading polymerase such as Pfu polymerase. 
Alternatively, non-methylated linkers (listed in Table 9) containing a small mismatch 
on the biotinylated strand may be used, followed by the above steps of ligation, 
capture, and purification. 



15 
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Table 9. New Drdl linkers/primers for representational shotgun cloning (no 
amplification). 



Primer 



Sequence (5'->3') 



DAAlOl (New) 

DAA102R (New) 
DAAP3 

DAC105 (New) 

DAC106R (New) 
DACP7 



5' Biotin-C18 spacer- GAA TAG A^G GGA TGA CTA CGT 
GTA A 3' (SEQ. ID. No. 61) 

5' pA CAC GTA GTC ATC £CG GGT ATT C 3' (SEQ. ID. 
No. 62) 

5' GAA TAG GCG GGA TGA CTA GGT GTsA sA 3' (SEQ. 
ID. No. 63) 

5' Biotin-C18 spacer- GAT ACh &GG GAT GAG TAG GAG 
3' (SEQ. ID. No. 64) 

5' pT GTC GTA GTC ATG £CG GGT ATC 3' (SEQ. ID. 
No. 65) 

S' GAT ACC CGG GAT GAG TAG GAG AsAsC 3' (SEQ. ID. 
No. 66) 



DAG109 (New) 

DAGllOR (New) 
DAGPll 



5' Biotin-Cl8 spacer- GAT AG^ ^GG GAT GAG TAG GTC 
AAG 3' (SEQ. ID. No. 67) 

5' pT GAG GTA GTC ATG £CG GGT ATC 3' (SEQ. ID. 
No. 68) 

5' GAT ACC CGG GAT GAG TAG GTC AsAsG 3' (SEQ. ID. 
No. 69) 



DCA113 (New) 



DGAGAGG141822R2 
(New) 

DCAP15 



DGA117 (New) 



DGA19 



5' Biotin-C18 spacer- GAT TAG AA G GGA TGA CTA CGT 
ATG A 3' (SEQ. ID. No. 70) 

5' pA TAG GTA GTC ATC £CG GGT AAT G 3' (SEQ. ID. 
No. 71) 

5' GAT TAG CGG GGA TGA GTA CGT ATsGs A 3' (SEQ. 
ID. No. 72) 

5' Biotin-C18 spacer- GAT TAG ^G GGA TGA GTA GGT 
ATG A 3' (SEQ. ID. No. 73) 

5' GAT TAG GCG GGA TGA GTA CGT ATsG sA 3' (SEQ. 
ID. No. 74) 



DGG121 (New) 



DGGP23 



5' Biotin-C18 spacer- GAT TAG ^G GGA TGA GTA GGT 
ATG G 3' (SEQ. ID. No. 75) 

5' GAT TAG CGG GGT AGA CTA CGT ATsG sG 3' (SEQ. 
ID. No. 76) 



5 
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The resultant single strands are subsequently converted to double strands by extension 
of a perfectly matched, non-methylated primer using a proofreading polymerase such 
as Pfu polymerase. This procedure avoids PGR amplification altogether, but requires 
a large amount of starting genomic DNA. 
5 With an average of one SNP every 700 bp, the 19,700 fragments will 

contain about 16,400 SNPs. To find the most abundant SNPs, a 6-fold coverage of 
these fragments should suffice. This would amount to 1 18,400 sequencing runs from 
one direction and, for clones above 500 bp in length, an additional 50% (59,200 runs) 
from the other side of the fi-agment, for a total of 177,600 sequencing runs. 

10 For 500 bp reads, estimating 1 run per 2 hours of 96 reaction, with 12 

loadings/day, this equals 1,152 sequencing reads/machine/day. Thus, the shotgun 
cloning/sequencing of unique Drdl islands for finding mapped SNPs in a 6-fold 
coverage of the human genome would require only 15.4 days using 10 of the new PE 
3700 DNA sequencing machines. 

1 5 For obtaining SNPs using the other 5 representations (AC, AG, CA, 

GA, and GG), would take an additional 77 days yielding a total of 98,500 SNPs. To 
double this amoimt, one would evaluate SNPs using the complement overhangs (TT, 
GT, CT, TG, TC, and CC), which would require a simultaneous mapping from the 
original BAG library. 

20 In summary, the entire hxmian genome may be mapped using the Drdl 

island approach, and, using the shotgun representation cloning approach, 197,000 
mapped SNPs would be generated in just 88 days using 30 of the PE 3700 DNA 
sequencing machines. 

25 Hi gh-throughput detection of SNPs in a DrcR island representation on a DNA arrav. 

A good PGR amplification, starting with 100 pmoles of each primer in 
20 \i\ generates about 3 ^ig of DNA total about 40 cycles. For a 500 bp fragment, that 
is about 9 picomoles total = about 0.5 picomoles/nl. However, when PGR amplifying 
30 a mixture of fragments, one can generate a larger quantity of product, since product 
reannealing is the limiting factor in a typical PGR reaction. A good representation 
can generate 1 -2 (ig product per ^1, or a conservative 20 |ig product in a 20 |il 
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reaction. For a 500 bp fragment, that is about 60 picomoles total = about 3 
picomoles/fil. To make a representation for the DNA array, the concept is to 
selectively amplify a subset of the representation such that sufficient product is 
formed allowing for LDR discrimination of each SNP allele and addressable array 
5 capture/detection. 

A procedure for making a representation of genomic DNA which will 
amplify about 8,750 fragments, of which about 4,100 will contain mapped SNPs for 
evaluation on a 4,096 address universal addressable array is shown in Figure 49. Start 
with 100 ng of human DNA - 15,000 copies = 0.025 attomoles of each allele. The 

1 0 DNA is cut with Drdl, Taql, and Mspl, in the presence of phosphorylated Drdl 

adapters containing a unique two base 3' overhang (i.e. A A) and unphosphorylated 
Taql and Mspl adapters containing two base 5', and in the presence of T4 ligase, such 
that the linkers are added to their respective overhangs in a homogeneous reaction at 
37*^0 (See Figure 50). Alternatively, the Mspl/Taql adapter is phosphorylated, 

1 5 contains a 3' blocking group on the 3' end of the top strand, and contains a bubble. 
Phosphorylation of the linker and use of a blocking group eliminates the potential 
artifactual amplification of unwanted MspVMspl^ Taql-Mspl, or Taq\-Taq\ fragments. 
T4 ligase attaches the Drdl and Mspl/Taql adapters to their respective sites on the 
human genome firagments with biochemical selection assuring that most sites contain 

20 linkers (See Figure 50A). In carrying out this procedure, the initial steps are similar 
to those shovm in Figure 5, up to and including the PGR amplification phase which 
occurs immediately prior to sequencing, are followed. However, in this procedure, 
the representation is derived fi'om the total genomic DNA of a biological sample, be it 
from germline or tumor cells, not fi-om a BAG clone. Further, the PGR primer may 

25 have one or two additional base(s) on the 3' end to obtain a representation of the 

correct # of fragments (about 8,750 in the example provided). In addition, after PGR 
amplification, the amplification product is subjected to a ligase detection reaction 
("LDR") procedure to detect single base changes, insertions, deletions, or 
translocations in a target nucleotide sequence. The ligation product of the LDR 

30 procedure is then captured on an addressable array by hybridization to capture probes 
fixed to a solid support. This use of LDR in conjunction with the capture of a ligation 
product on a solid support is more fully described in WO 97/31256 to Gomell 
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Research Foundation, Inc. and Gerry, N. et al., "Universal DNA Array with 
Polymerase Chain Reaction/Ligase Detection Reaction (PCR/LDR) for Multiplex 
Detection of low Abundance Mutations," J. Mol. Biol. 292:251-262 (1999), which are 
hereby incorporated by reference. 
5 In brief, however, this procedure involves providing a plurality of 

oligonucleotide probe sets. Each set is characterized by (a) a first oligonucleotide 
probe, having a target-specific portion and an addressable array-specific portion and 
(b) a second oligonucleotide probe, having a target-specific portion and a delectable 
reporter label. The oligonucleotide probes in a particular set are suitable for ligation 

1 0 together when hybridized adjacent to one another on a corresponding target 

nucleotide sequence, but have a mismatch which interferes v^th such ligation when 
hybridized to any other nucleotide sequence present in the sample. The PGR 
amplification product, described in Figure 50, the plurality of oligonucleotide probe 
sets, and the ligase are blended to form a mixture which is subjected to one or more 

15 ligase detection reaction cycles. The ligase detection reaction cycles include a 

denaturation treatment, where any hybridized oligonucleotides are separated from the 
target nucleotide sequences, and a hybridization treatment, where the oligonucleotide 
probe sets hybridize at adjacent positions in a base-specific manner to their respective 
target nucleotide sequences, if present in the sample, and ligate to one another to form 

20 a ligated product sequence containing (a) tlie addressable array-specific portion, 
(b) the target-specific portions connected together, and (c) the detectable reporter 
label. The oligonucleotide probe sets may hybridize to nucleotide sequences in the 
PGR amplification product other than their respective target nucleotide sequences but 
do not ligate together due to a presence of one or more mismatches. As a result, the 

25 nucleotide sequences and oligonucleotide probe sets individually separate during the 
denaturation treatment. 

A support with different capture oligonucleotides immobilized at 
particular sites is used in conjunction with this process. The capture oligonucleotides 
have nucleotide sequences complementary to the addressable array-specific portions. 

30 The mixture, after being subjected to the ligase detection reaction cycles, is contacted 
with the support under conditions effective to hybridize the addressable array-specific 
portions to the capture oligonucleotides in a base-specific manner. As a result, the 
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addressable array-specific portions are captured on the support at the site with the 
complementary capture oligonucleotide. Reporter labels of the ligated product 
sequences captured to the support at particular sites are detected. This permits the 
presence of one or more target nucleotide sequences in the sample to be identified. 
5 The ligase detection reaction process phase of the present invention is 

preceded by the representational polymerase chain reaction process of the present 
invention. The preferred thermostable ligase is that derived from Thermus aquaticus. 
This enzyme can be isolated from that organism. M. Takahashi, et al., "Thermophillic 
DNA Ligase " J. Biol. Chem. 259:10041-47 (1984), which is hereby incorporated by 

10 reference. Alternatively, it can be prepared recombinantly. Procedures for such 

isolation as well as the recombinant production of Thermus aquaticus ligase as well as 
Thermus themophllus ligase) are disclosed in WO 90/17239 to Barany, el. al., and F. 
Barany, et al., "Cloning, Overexpression and Nucleotide Sequence of a Thermostable 
DNA-Ligase Encoding Gene." Gene 109:1-1 1 (1991), which are hereby incorporated 

1 5 by reference. These references contain complete sequence information for this ligase 
as well as the encoding DNA. Other suitable ligases include E. coli ligase, T4 ligase, 
Pyrococcus ligase, as well as those listed in Table 3. 

The hybridization step, which is preferably a thermal hybridization 
treatment, discriminates between nucleotide sequences based on a distinguishing 

20 nucleotide at the ligation junctions. The difference between the target nucleotide 
sequences can be, for example, a single nucleic acid base difference, a nucleic acid 
deletion, a nucleic acid insertion, or rearrangement. Such sequence differences 
involving more than one base can also be detected. Preferably, the oligonucleotide 
probe sets have substantially the same length so that they hybridize to target 

25 nucleotide sequences at substantially similar hybridization conditions. 

Tlie process of the present invention is able to detect nucleotide 
sequences in the sample in an amount of 100 attomoles to 250 femtomoles. 
Quantitative detection of G12V mutation of the K-ras gene, from 100 attomoles to 30 
femtomoles using two LDR probes in the presence of 10 microgram salmon sperm 

30 DNA is shown in Figure 51 . By coupling the LDR step with a primary polymerase- 
directed amplification step, the entire process of tlie present invention is able to detect 
target nucleotide sequences in a sample containing as few as a single molecule. 
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Furthermore, PGR amplified products, which often are in the picomole amounts, may 
easily be diluted within the above range. The ligase detection reaction achieves a rate 
of formation of mismatched Hgated product sequences which is less than .005 of the 
rate of formation of matched ligated product sequences. 
5 Once the ligation phase of the process is completed, the capture phase 

is initiated. During the capture phase of the process, the mixture is contacted with the 
support at a temperature of 45-90*^C and for a time period of up to 60 minutes. 
Hybridizations may be accelerated by adding volume exclusion, chaotropic agents, or 
Mg^"*^. When an array consists of dozens to hundreds of addresses, it is important that 

10 the correct Ugation products have an opportunity to hybridize to the appropriate 

address. This may be achieved by the thermal motion of oligonucleotides at the high 
temperatures used, by mechanical movement of the fluid in contact with the array 
surface, or by moving the oligonucleotides across the array by electric fields. After 
hybridization, the array may be washed sequentially with a low stringency wash 

1 5 buffer and then a high stringency wash buffer. 

It is important to select capture oligonucleotides and addressable 
nucleotide sequences which will hybridize in a stable fashion. This requires that the 
oligonucleotide sets and the capture oligonucleotides be configured so that the 
oligonucleotide sets hybridize to the target nucleotide sequences at a temperature less 

20 than that which the capture oligonucleotides hybridize to the addressable array- 
specific portions. Unless the oligonucleotides are designed in this fashion, false 
positive signals may result due to capture of adjacent unreacted oligonucleotides fi'om 
the same oligonucleotide set which are hybridized to the target. 

Several approaches have been tested to produce universal addressable 

25 arrays. One himdred different 2- and 3-dimensional matrices were tested; the current 
formulation uses an acrylamide/acrylic acid copolymer containing low levels of bis- 
acrylamide crosslinker. The polymer surfaces were prepared by polymerizing the 
monomer solution on glass microscope slides pretreated with a silane containing an 
acryl moiety. Amino-modified address oligonucleotides containing a hexaethylene 

30 oxide spacer were hand-spotted onto NHS pre-activated slides and coupled for 1 hour 
at 65**C in a humidified chamber. Following coupling, the polymer was soaked in a 
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high salt buffer for 30 minutes at 65°C to remove all uncoupled oligonucleotides. 
Both activated and arrayed surfaces can be stored under dry conditions for several 
months with no decrease in activity. 

Hybridization conditions were varied with respect to temperature, time, 
5 buffer, pH, organic solvents, metal cofactors, volume exclusion agents, and mixing 
conditions, using test fluorescently-labeled zip-code complementary probes. Under a 
variety of conditions, no cross-hybridization was observed between even closely 
related addresses, with signal-to-noise of at least 50:1. Different addresses hybridize 
at approximately the same rate yielding approximately the same quantity of 

1 0 fluorescent signal when normalized for oligonucleotide coupled per address. The 
probes diagrammed in Figure 52 were synthesized and tested in a multiplex 
PCR/LDR reaction on cell line DNA containing knovm K-ras mutations. Each array 
identified the mutation correctly with signal-to-noise of at least 20: 1 (Figure 53). 
Further, this demonstrates the ability of the xmiversal array to detect two single- 

15 nucleotide polymorphisms simultaneously: the wild-type and mutant sequence are 
present in all panels except from normal cells or from the cell line containing only the 
G12V mutant DNA. 

The detection phase of the process involves scanning and identifying if 
ligation of particular oligonucleotide sets occurred and correlating ligation to a 

20 presence or absence of the target nucleotide sequence in the test sample. Scanning 
can be carried out by scanning electron microscopy, confocal microscopy, 
charge-coupled device, scanning tunneling electron microscopy, infrared microscopy, 
atomic force microscopy, electrical conductance, and fluorescent or phosphor 
imaging. Correlating is carried out with a computer. 

25 To determine DNA array capture sensitivity, mixtures of an excess of 

unlabeled to labeled probe were tested. This simulates an LDR reaction where an 
excess of imligated probes compete with the labeled LDR products for hybridization 
to the array. DNA arrays were hybridized in quadruplicate udth from 100 amoles to 
30 fmol FamCZipl3 (synthetic 70-mer LDR product ) mixed with a full set ofK-ras 

30 LDR probes (combined total of 9 pmol of discriminating and common probes) under 
standard conditions. The arrays were analyzed on a Molecular Dynamics 
Fluorlmager 595 and an Olympus AX70 epifluorescence microscope equipped with a 



wo 00/40755 



PCT/USOO/00144 



- 116- 

Princeton Instruments TE/CCD-512 TKBMl camera, A signal-to-noise ratio of 
greater than 3:1 was observed even when starting with a minimum of 3 finol 
FamCZipl3 labeled-probe within 4,500 fmol Fam label and 4,500 fmol addressable 
array-specific portion in the hybridization solution (see Figure 54). Using the 
5 microscope/CCD instrumentation, a 3:1 signal-to-noise ratio was observed even when 
starting v^th 1 ftnol labeled product (see Figure 54). Thus, either instruments can 
readily quantify LDR product formed by either K-raj allele at the extremes of allele 
imbalance (from 6-80 fmol, see Table 1 1 .) 

For both instruments, a linear relationship is observed between labeled 

10 FamCZipl3 added and fluorescent counts captured. Each array was plotted 

individually, and variation in fluorescent signal between arrays may reflect variation 
in amount of oligonucleotide coupled due to manual spotting and/or variation in 
polymer uniformity. Rehybridization of the same probe concentration to the same 
array is reproducible to +/- 5%, with capture efficiency from 20 to 50%. Since the 

15 total of both labeled and unlabeled addressable array-specific portions which 

complement a given address remains unchanged (at 500 finol) from LDR reaction to 
LDR reaction, this result demonstrates the ability to quantify the relative amount of 
LDR product using addressable array detection. Since the relationship between 
starting template and LDR product retains linearity over 2 orders of magnitude with a 

20 similar limit of sensitivity at about 100 amols (see Figure 51), combining PCR/LDR 
allele discrimination with array-based detection will provide quantifiable results. 

As shown in Figure 50, in embodiment A, the LDR oligonucleotide 
probe sets have a probe with the discriminating base labeled at its opposite end (i.e. 
fluorescent groups Fl and F2), while the other probe has the addressable array- 

25 specific portion (i.e. Zl). Alternatively, in embodiment B, the LDR oligonucleotide 
probe sets have a probe with the discriminating base and the addressable array- 
specific portion at its opposite end (i.e. Zl and Z2), while the other probe has the label 
(i.e. fluorescent label F). When contacted with the support, the ligation products of 
embodiment A are captured at different sites but the same array address and ligation 

30 products are distinguished by the different labels Fl and F2. When the support is 

contacted with the ligation products of embodiment B, the different ligation products 
all have the same label but are distinguished by being captured at different addresses 
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on the support. In embodiment A, the ratio of the different labels identifies an allele 
imbalance. Likewise, such an imbalance in embodiment B is indicated by the 
fluorescence ratio of label F at the addresses on the support. 

In carrying out tliis procedure, one may start with 100 ng of human 
5 DNA = 1 5,000 copies = 0.025 attomoles of each allele. The DNA is cut with Drdl, 
Taql, and Mspl, in the presence of phosphorylated Drdl adapters containing a unique 
two base 3' overhang (i.e. AA) and unphosphorylated Taq\ and Mspl adapters 
containing two base 5', and in the presence of T4 ligase, such that the linkers are 
added to their respective overhangs in a homogeneous reaction at 37°C. Enzymes are 

1 0 inactivated by heating at SS'^C to 98*^0, preferably 95°C, for 2 to 20 minutes, 

preferably for 5 minutes. PCR amplification using a primer complementary to the 
Drdi adapter with an additional 3' base, i.e. (3' AAC) and a primer complementary to 
the other adapter will give a representation of 0.19% of the total genomic DNA. 

A PGR amplification of 30 to 35 cycles will give a good representation 

1 5 and produce about 10-20 ^g of final mixed fragments. Some variation of 

thermocycling conditipns may be required to obtain a broad representation of the 
majority of fi:agments at high yield. The PCR amplification v^H contain an average of 
1 .5 X 10^ copies for each allele of the approximately 8,750 fragments in the 
representation. This is equivalent to an average yield of 2.5 frnoles of each product. 

20 The larger fi-agments will yield less PCR product (about 1 fmole each), while the 
smaller fragments will yield a greater amount of product (from 5-10 finole each). 

The same approach may be used for amplifying SNP containing 
firagments using either a different base on the 3' end, or alternatively, a different Drdi 
overhang. A total of 24 representation PCR reactions generate the amplicon sets for 

25 testing all 98,000 SNPs. Further, fragments amplified in the smaller representation 
may also be cloned and sequenced to find SNPs. 

The above procedure can be modified such that the representation will 
contain more or less fragments, and/or improve the yield of all firagments. For 
example, a size-selection between 200 and 2,000 bp prior to PCR amplification may 

30 improve the yield of fragments in the representation. For making larger 

representations, more than one linker for the Drdi site overhang may be used, e.g., 
both AA and AC overhangs, and PCR primers complementary to the Drdi adapter 
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with an additional 3' base (i.e. 3' AAC and 3' ACC) would double the representation 
to approximately 17,500 fragments. Alternatively, more than one PGR primer 
complementary to the Drdl adapter with an additional 3' base (i.e. 3' AAC and 3' 
AAT) would also double the representation to approximately 17,500 fragments. 
5 Larger representations may be used if PCR amplification generates sufficient product 
for detection on the above described universal array, and/or as detection sensitivity 
improves. For making smaller representations, one or two PCR primers with two 
additional selective bases on the 3' end is used during the PCR amplification step, i.e 
(3'AAAC + 3*AAAG) would reduce the representation to approximately 4,400 

10 fragments, while use of just one primer (3'AAAC) would reduce the representation to 
approximately 2,200 fragments. The ideal size of the representation will depend on 
the number of SNPs which will be detected (See Table 10). Other restriction 
endonucleases with degenerate overhangs as the primary enzyme may be used to 
create the representation, such as BgH, Dralll, AlwNl, P/7MI, Accl, BsiHKAl, SanDl, 

1 5 SexAl, Ppul, Avally EcoO\09, Bsu36l, BsrDl, Bsgl, Bpml, Sapl, or an isoschizomer 
of one of the aforementioned enzymes. Palindromic restriction endonucleases may 
also be used to create the representation, such as BamHl, ^vrll, Nhel, Spel, Xbal, 
Kpnh Sphh Aatll Agel, Xmal, NgoMl, BspEl, MIul, Sacll, Bsim, Pstl, ApalA, or an 
isoschizomer of one of the aforementioned enzymes. 

20 

Table 10: High-throughput detection of SNPs on a DNA array 
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17,500 


8^00 


0 J8 % 


0.5-5 



30 

r.arge scale detecti nn of SNPs using DrcR island representations and DNA array 

New technologies to identify and detect SNPs specifically provide 
35 tools to further understanding of the development and progression of colon cancer. 
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One can determine chromosome abnormalities by quantifying allelic imbalance on 
universal DNA arrays using specific SNPs ai multiple loci. This approach has the 
potential to rapidly identify multiple gene deletions and amplifications associated with 
tumor progression, as well as lead to the discovery of new oncogenes and tumor 
5 suppressor genes. 

Competitive and real time PGR approaches require careful 
optimization to detect 2-fold differences. Unfortunately, stromal contamination may 
reduce the ratio between tumor and normal chromosome copy number to less than 2- 
fold. Consider two samples: one with 4-fold amplification of the tumor gene (thick 

10 black line) and 50% stromal contamination, the other with loss of heterozygosity 
(LOH, one chromosome containing the gene is missing, thin black line) and 40% 
stromal contamination (See Figure 55). Using either microsatellite or SNP analysis, 
both samples would show an allele imbalance of 2.5 : 1 for the tumor gene (black), 
and allele balance for the control gene (gray. Figure 55, first line). Comparing the 

1 5 ratio of the tumor gene in the tumor sample to the control gene over the ratio of the 
tumor gene in the normal sample (normalized to the same number of cells) to the 
control gene, the stromal contamination reduces the ratio from the amplified sample 
to 1 .75 and increases the ratio fi-om the LOH sample to 0.7 (Figure 55, second line). 
These ratios are exceedingly difficult to distinguish firom 1.0 by competitive PCR. 

20 However, by using SNP analysis to compare the ratio of tumor to control allele (i.e. 
thick line) over the ratio of normal to conttol allele, then it may be possible to 
distinguish gene amplification (thick black line) with a ratio of 2.5 fi*om LOH (thin 
black line) with a ratio of 0.4 (Figure 55, bottom line). It is important that relative 
allele signal can be accurately quantified. 

25 To determine if PCR/LDR allows accurate quantification of mutant 

and wild-type K-ras alleles, PCR-amplified fragments derived fi-om pure cell lines 
were mixed in varying ratios and analyzed in a competitive three LDR probe system 
in which upstream discriminating probes specific for either the wild-type or the G12V 
mutant allele competed for a downstream probe common to both alleles (Figure 56). 

30 Optimal quantification was achieved by using LDR probes in slight excess of K-ras 
template and limiting LDR cycles so products were in the linear range for fluorescent 
quantification on an ABI 373 sequencer. Under these conditions, mutant/wt ratios 
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from 1:6 to 6:1 could be accurately quantified, and when normalized to the 1:1 
products were within 10% of the predicted value (Table in Figure 56). Similar results 
were obtained using probe sets for G12D, G12C, and Gl 3D. Quantitative LDR was 
performed on PCR-amplified DNA isolated from 10 colorectal carcinoma cell lines. 
5 Four cell lines contained either pure mutant or wild-type ("wt") alleles, three 

contained approximately equal amounts of mutant and vn alleles (0.7 - 1.1), and three 
contained an increased ratio of mutant:wt alleles (1 .8-4.0). AlleHc imbalance was 
highly correlated to the proportion of cellular p21 ras protein present in the activated, 
GTP -bound form. These data support the conclusion that allelic imbalance with 
10 amplification of the mutant K-ras gene is a second genetic mechanism of K-ras 
activation. 

Genomic DNA was extracted from 44 archival primary colon cancers 
knovm to contain a point mutation in the K-ras gene, amplified using PGR primers 
specific for exon 1 of K-ra^, and quantified with competitive LDR. The percentage of 

15 stromal cell contamination in primary colon cancers was estimated by an independent 
pathologist for each sample and this value was used to correct the mutant:wt ratio 
(Table 11). K-ras allelic imbalance was calculated to be 2-fold or greater whenever 
the corrected mutant/v^ ratio measured by LDR exceeded 2 (Table 1 1). To evaluate 
the impact of K-ras allelic imbalance in this group of patients, disease-specific 

20 survival curves were obtained by the Kaplan-Meier method using the log-rank test. 
While tumors with v^ld-type or non-amplified K-ras mutations (mutant: wt ratio < 2) 
showed similar survival trends, tumors with amplification of K-ra^ (ratio > 2) had a 
significantly worse survival compare to mutant tumors without allelic imbalance (p = 
0.03) and to wt tumors (p = 0.0001). Thus, gene amplification is an important second 

25 mechanism of K-ras activation and negatively impacts on disease-specific survival in 
colon cancer. 
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Tabie 11, Corrected ratios of mutant K-ras to wild-type alleles in primary colon 
cancers. 



Representative samples with K-ras mutation Representative samples with K-ras mutation 

and low-level allele imbalance (< 2) and high-level allele imbalance (> 2) 
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5 

Colon cancer tumors with known K-ras genotype were analyzed to determine the degree of allelic 
imbalance using a modified PCR/LDR technique. The mutant/wt ratio was determined experimentally 
and corrected based on the estimated percentage of stromal contamination in the microdissected tumor 
specimen, using the formula: X = mutant/wt (Observed) x (%T + 2(1 -%T)) / %T, where X 

1 0 Corrected mutant/wt ratio of Chromosomes, and %T = Percentage of tumor cells in section. Allelic 
imbalance was considered significant when the ratio was more than 2.0 (e.g., at least two copies of the 
mutant allele compared to one copy of the wt allele in the tumor). For low mutant: wt ratios, allele 
imbalance may also be due to loss of the normal K-ras allele in the tumor cell, e.g., an observed 
mutant: wt ratio of 0.5 with 50% of the cells from the tumor (samples #3 & #10) may reflect one mutant 

15 allele in the tumor cell to two wild-type alleles in the normal cell. Under these calculations X = 
mutant/wt (Observed) x 2(1-%T) / %T = 0.5 x 2(l-0.5)/0.5 = 1 mutant K-ras allele in the tumor cell, 
with LOH of the other allele. The left side of the table shows representative samples in which allelic 
imbalance was minimal while the right side of the table shows representative samples in which the K- 
ras mutant allele is amplified. The table demonstrates that the corrected mutant: wt ratio is dependent 

20 on both the observed ratio and the percentage of stromal contamination in the sample. 



The above data demonstrates that PCR/LDR may be used to accurately 
quantify mutant and wild-type K-ras alleles using an automated DNA sequencer to 
25 detect the fluorescent signal. Further, the work above demonstrated that femtomole 
amounts of CZip fluorescently-labeled product in picomole quantities of total probe 
and label can be captured at its cognate address and quantified using either 
Fluorlmager or CCD detection. 

The use of fluorescently-labeled oligonucleotides on DNA arrays have 
30 die advemtages of multiple labels, long lifetimes, ease of use, and disposal over 

traditional radiolabels. However, the efficiency of fluorescent emissions from a given 
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fluorophore is dependent on multiple variables (i.e. solvation, pH, quenching, and 
packing within the support matrix) which makes it difficult to produce accurate 
calibration curves. This problem may be effectively circumvented by using two 
fluorescent labels and determining their ratio for each address (Hacia, et al., 
5 "Detection of Heterozygous Mutations in BRCAl Using High Density 

Oligonucleotide Arrays and Two-Colour Fluorescence Analysis," Nature Genetics , 
14(4):441.7 (1996); DeRisi, et al., "Use of a cDNA Microarray to Analyse Gene 
Expression Patterns in Human Cancer," Nature Genetics . 14(4):457-60 (1996); 
Schena, et al., "Parallel Human Genome Analysis: Microarray-Based Expression 

10 Monitoring of 1000 Genes", Proc. Nat Acad. Sci. USA . 93(20):10614-9 (1996); 
Shalon, et al., "A DNA Microarray System for Analyzing Complex DNA Samples 
Using Two-Color Fluorescent Probe Hybridization," Genome Research . 6(7):639-45 
(1996); and Heller, et al., "Discovery and Analysis of Inflammatory Disease-Related 
Genes Using cDNA Microarrays," Proc. Nat'l. Acad. Sci. USA . 94(6):2150-5 (1997), 

1 5 which are hereby incorporated by reference). 

Below two sets of alternative dual labeling strategies are addressed. In 
the first set, shown in Figure 57, signal is quantified by using a fluorescent label on 
the array surface at the address. In the second and preferred set, shovm in Figure 62, 
signal is quantified by using a small percentage of fluorescent label on the probe 

20 which contains the capture oligonucleotide complement. 

The first set of dual label strategies to quantify LDR signal using 
addressable DNA arrays is shown in Figures 57A-B. In Figure 57A, the common 
LDR probe for both alleles contains a fluorescent label (Fl) and the discriminating 
probe for each allele contains a unique address-specific portion. Following 

25 hybridization of the LDR reaction mixture to an array composed of fluorescently- 
labeled (F2) ligation product, the ratio of F1/F2 for each address can be used to 
determine relative percent mutation or allelic imbalance. In Figure 57B, the common 
probe for both alleles contains an address-specific portion and the discriminating 
probe for each allele contains a unique fluorescent label, Fl or F2. Following LDR, 

30 the reaction mixture is hybridized to an array and the ratios of F1/F2 for each address 
can again be used to determine relative percent mutation or allelic imbalance. In 
addition, by adding a third label, F3, to the oligonucleotide coupled to the surface it 
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will be possible to quantify each allele separately. One method of determining allele 
imbalance compares (Fl captured signal/F2address signal) where the matched tumor and 
normal samples are hybridized to two different arrays (where variability in addresses 
is less than 10%, achieved by printing two arrays on the same slide). The allele 
5 imbalance is calculated for each sample by the formula {(Fl Allele 1: tumor^2Address l) 
/(Fl Allele 2: tumor/F? Address 2)} / {(Fl Allele I: norm a|/F2 Address l) / (Fl Allele 2: 
normal/F2 Address 2)} - Even if considerable variance between addresses remains, the 
overall calculation for the ratio of allele imbalance will remain accurate, provided the 
identical reusable array is used for both tumor and normal samples, in which case the 
10 above equation simplifies to (Fl Allele 1: tumor/Fl Allele 1: normal) / (Fl Allele 2: 

tumor/Fl Allele 2: nomnaO- 

The advantages of using the present invention compared to other 
detection schemes are as follows: this approach to polymorphism detection has three 
orthogonal components: (i) primary representational PGR amplification; (ii) solution- 

15 ' phase LDR detection; and (iii) solid-phase hybridization capture. Therefore, 

background signal from each step can be minimized, and consequently, the overall . 
sensitivity and accuracy of the method of the present invention are significantly 
enhanced over those provided by other strategies. For example, "sequencing by 
hybridization" methods require: (i) multiple rounds of PGR or PGR/T? transcription; 

20 (ii) processing of PGR amplified products to fragment them or render them single- 
stranded; and (iii) lengthy hybridization periods (10 h or more) which limit their 
throughput. Additionally, since the immobilized probes on these arrays have a wide 
range of TmS, it is necessary to perform the hybridizations at temperatures fi-om 0 °G 
to 44 °G. The result is increased background noise and false signals due to mismatch 

25 hybridization and non-specific binding, for example, on small insertions and deletions 
in repeat sequences. In contrast, the present approach allows multiplexed PGR in a 
single reaction, does not require an additional step to convert product into single- 
stranded form, and can readily distinguish all point mutations including 
polymorphisms in mononucleotide and short dinucleotide repeat sequences. This last 

30 property expands the number of polymorphisms which may be considered for SNP 
analysis to include short length polymorphisms, which tend to have higher 



wo 00/40755 



PCT/USOO/00144 



- 124- 

heterozygosities. Alternative DNA arrays suffer from differential hybridization 
efficiencies due to either sequence variation or to the amount of target present in the 
sample. By using divergent sequences for the addressable array-specific portion (i.e. 
zip-code) with similar thermodynamic properties, hybridizations can be carried out at 
5 resulting in a more stringent and rapid hybridization. The decoupling of the 

hybridization step from the mutation detection stage offers the prospect of 
quantification of LDR products, as we have already achieved using gel-based LDR 
detection. 

Arrays spotted on polymer surfaces provide substantial improvements 

10 in signal capture compared with arrays spotted directly on glass surfaces. The 

polymers described above are limited to the immobilization of 8- to 10-mer addresses; 
however, the architecture of the presently described polymeric surface readily allows 
24-mer addresses to penetrate and couple covalently. Moreover, LDR products of 
length 60 to 75 nucleotide bases are also found to penetrate and subsequently 

1 5 hybridize to the correct address. As additional advantages, the polymer gives little or 
no background fluorescence and does not exhibit non-specific binding of 
fluorescently-labeled oligonucleotides. Finally, addresses spotted and covalently 
coupled at a discrete address do not "bleed over" to neighboring spots, hence 
obviating the need to physically segregate sites, e.g., by cutting gel pads. 

20 Nevertheless, alternative schemes for detecting SNPs using a primary 

representational PGR amplification have been considered and are briefly included 
herein. Since the representations are the consequence of amplification of fragments 
containing two different adapters, the procedure may be easily modified to render 
single stranded product which is preferred for "sequencing by hybridization" and 

25 single nucleotide polymerase extension ("SNUPE") detection. Thus, one linker 
adapter may contain a T7 or other RNA polymerase binding site to generate single- 
stranded fluorescently labeled RNA copies for direct hybridization. Or, one strand 
may be biotinylated and removed with streptavidin coated magnetic beads. Another 
alternative option is to put a 5' fluorescent group On one probe, and a phosphate group 

30 on the 5' end of the other probe and treat the mixture with Lambda Exonuclease. This 
enzyme will destroy the strand containing the 5' phosphate, while leaving the 
fluorescently labeled strand intact. 
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For detection using single nucleotide polymerase extension 
("SNUPE")) a probe containing an addressable array-specific portion on the 5' end, 
and a target-specific portion on the 3' end just prior to the selective base is hybridized 
to the target. Fluorescently labeled dye-dioxynucleotides are added with a high 
5 fidelity polymerase which inserts the labeled base only if the complementary base is 
present on the target (Figure 58). The ratios of F1/F2 for each address can be used to 
determine relative percent mutation or allelic imbalance. 

Alternatively, LDR products may be distinguished by hybridizing to 
gene specific arrays (Figure 59A-B), This may be achieved by hybridizing to the 

10 common probe (Figure 59A) or across the ligation junction (Figure 59B). A 

"universal" nucleotide analog may be incorporated into the address so that neither 
allele product hybridizes better to the array. Again, the ratios of F1/F2 for each 
address can be used to determine relative percent mutation or allelic imbalance. 
For large representations, or direct detection of any SNPs in the 

15 absence of a representation, LDR/PCR may be used (Figure 60). In this scheme, the 
discriminating probes contain universal probes with unique addressable portions on 
the 5' side, while the common probes have universal primers on the 3' side. The 
upstream probe has the addressable array-specific portion in-betweeri the target- 
specific portion and the universal probe portion, i.e. the probe will need to be about 70 

20 bp long. After an LDR reaction, the LDR products are PCR amplified using the 

universal PCR primer pair, with one primer fluorescently labeled. To avoid ligation 
independent PCR amplification, it may be necessary to incorporate a series of 
blocking groups on the 3' end of the downstream common probe (excellent successes 
have been achieved by applicants with thiophosphate linkages of the last four O- 

25 methyl riboU bases), and treat the ligation products v^th Exo III. See WO 97/45559, 
which is hereby incorporated by reference. 

The addressable array-specific portion is now in the middle of a 
double-stranded product. For maximum capture efficiency, it may be desirable to 
render the product single-stranded, either with T7 RNA polymerase or with 

30 biotinylated probe. One alternative option is to put a 5' fluorescent group on one 
probe, and a phosphate group on the 5* end of the other probe and treat the mix with 
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Lambda Exonuclease (See Figure 61). This enzyme will destroy the strand containing 
the 5' phosphate, while leaving the fluorescently labeled strand intact. 

The final products are then captured on the addressable array at the 
specific addresses. The ratio of signal at Z1/Z2 can be used to determine relative 
5 percent mutation or allelic imbalance. It may be difficult to quantify subtle 

differences of allele imbalance since the different addressable array-specific portions 
may alter the ratio of alleles in the final PGR product. Nevertheless, LDR/PCR may 
aid in quantification of LOH and gene amplifications at multiple loci simultaneously. 

Figure 62 presents the second set of dual label strategies to quantify 

1 0 LDR signal using addressable DNA arrays. In Figure 62A, the common LDR probe 
for both alleles contains a fluorescent label (Fl) and the discriminating probe for each 
allele contains a unique addressable sequence. A small percentage of each 
discriminating probe contains a fluorescent label F2. Following hybridization of the 
LDR reaction mixture to an array, the ratio of F1/F2 for each address can be used to 

15 determine relative percent mutation or allelic imbalance. By placing the second 
• fluorescent label on both discriminating probes, one controls for differences in either 
address spotting or hybridization kinetics of each individual address. For example, 
consider that 10% of the discriminating probes contain F2. Consider a sample 
containing 3-fold more of the C allele than the T allele. After an LDR reaction, 20% 

20 of the common probe has been ligated to form the T-specific product containing 
address-specific portion Zl, and 60% has formed the C-specific product containing 
address-specific portion Z2. Due to differences in spotting, the array captures 50% of 
the Zl signal, but only 30% of the Z2 signal. F1/F2 for Zl = (50% of 20%)/(50% of 
10%) =.10%/5% = 2. F1/F2 for Z2 = (30% of 60%)/(30% of 10%) = 18%/3% = 6. By 

25 taking the ratio of Fi/F2 for Zl to F1/F2 for Z2, 6/2 = 3 is obtained which accurately 
reflects the allele imbalance in the sample. 

In Figure 62B, the common probe for both alleles contains an 
addressable sequence and the discriminating probe for each allele contains a unique 
fluorescent label, Fl or F2. Following LDR, the reaction mixture is hybridized to an 

30 array and the ratios of F1/F2 for each address can again be used to determine relative 
percent mutation or allelic imbalance. In addition, by adding a small percentage of 
conmion probe containing label F3, it is possible to quantify each allele separately. 
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Dual label hybridization to the same address using dye combinations of 
fluorescein/phycoerythrin, fluorescein/Cy5 Cy3/rhodamine, and Cy3/Cy5 have been 
used successfully (Hacia, et a!., "Detection of Heterozygous Mutations in BRCAl 
Using High Density Oligonucleotide Arrays and Two-Colour Fluorescence Analysis," 
5 Nature Genetics . 14(4):441-7 (1996); DeRisi, et al., "Use of a cDNA Microarray to 
Analyse Gene Expression Patterns in Human Cancer," Nature Genetics . 14(4):457-60 
(1996); Schena, et al., "Parallel Human Genome Analysis: Microarray-Based 
Expression Monitoring of 1000 Genes," Proc. NatM. Acad. Sci. USA . 93(20): 106 14-9 
(1996); Shalon, et al., "A DNA Microarray System for Analyzing Complex DNA 

10 Samples Using Two-Color Fluorescent Probe Hybridization," Genome Research . 
6(7):639-45 (1996); and Heller, et al., "Discovery and Analysis of Inflammatory 
Disease-Related Genes Using cDNA Microarrays," Proc. NatM. Acad. Sci. USA . 
94(6):21 50-5 (1 997), which are hereby incorporated by reference). A list of potential 
dyes which may be used in the labeling schemes described above is provided in 

1 5 Table 12. For the above schemes to be successful, the dye sets used should not 
interfere with each other. 
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Table 12: List of Dyes which may be used for fluorescent detection of SNPs. 



Dye 


Abs. Max (nm) 


Em. Max (nm) 


Marina Blue 


365 


460 


Flourescein 


495 


520 


TET 


521 


536 


TAMRA 


565 


580 


Rhodamine 


575 


590 


ROX 


585 


610 


Texas Red 


600 


615 


Cy2 


489 


506 


Cy3 


550 


570 


Cy3.5 


581 


596 


Cy5 


649 


670 


Cy5.5 


675 


694 


Cy7 


743 


767 


Spectrum Aqua 


433 


480 


Spectrum Green 


509 


538 


Spectrum Orange 


559 


588 


BODIPY FL 


505 


515 


BODIPY R6G 


530 


550 


BODIPY TMR 


545 


575 


BODIPY 564/6570 


565 


575 


BODIPY 581/591 


580 


600 


BODIPY TR 


595 


625 


BODIPY 630/650 


640 


650 



5 A representational PGR amplification will contain an average of 1 .5 x 

10^ copies of each allele of approximately 8,750 fragments in the representation. This 
is equivalent to an average yield of 2.5 fmoles of each product. The larger fragments 
will yield less PGR product (about 1 fmole each), while the smaller fragments will 
yield a greater amount of product (from 5-10 fmole each). Of these 8,750 fragments, 
10 about 4,100 will contain SNPs. As demonstrated above, the representational 

PGR/LDR/universal array capture scheme should have the requisite sensitivity to 
detect gene amplification or loss of heterozygosity at the vast majority of these SNPs 
simultaneously. 

This scheme has immediate utility for detecting allele imbalance in 
15 tumors. An initial array of 4,096 addresses may be used to find general regions of 
gene amplifications or LOH. Subsequently, arrays may be used to pinpoint the 
regions using more closely-spaced SNPs. 



I 
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A major advantage of the representational PGR amplification is the 
ability to amplify approximately 8,750 fragments proportionally to their original copy 
number in the original sample. While some fragments may amplify more than others, 
repeated amplification of normal samples will reveal fragments whose PGR and LDR 
5 products are consistently amplified to similar yields. Thus, for a given fragment 
which is either amplified or lost in the tumor (designated "g") there will be at least 
one firagment which retains normal yields (designated "c") For each allele pair (gl, 
g2) which is imbalanced, there is a control locus (cl, c2) which exhibits 
heterozygosity in both the normal and tumor sample. To determine if a given allele 

10 has been amplified or deleted, tlie ratio of ratios between matched tumor and normal 
samples is calculated, e.g., r = (gltumor/cl tumor) / (glnormal/cl norma))- If r > 2 then gl 
is amplified, if r< 0.5, then gl is deleted. The identical calculation is also apphed to 
the matched alleles, g2 and c2 which should yield a value of approximately 1 .0, 
except for cases such as K-raj, where one allele may be lost while the other (mutated) 

1 5 allele is amplified. These calculations may be performed with additional informative 
SNPs in a given region matched wdth different control regions. Certain SNP/control 
pairs will amplify at similar rates and, hence, more accurately reflect relative gene 
copy number. 

Examples of the different schemes for distinguishing gene 
20 amplification from loss of heterozygosity are illustrated in Figures 63-66. These four 
figures demonstrate how representational PGR/LDR with addressable array capture 
may be used to distinguish amplification of genes at the DNA level (Figures 63-64) 
or, alternatively, loss of one chromosomal region at that gene (LOH, Figures 65-66). 
Detection of differences using the address complements on the discriminating probes 
25 are illustrated in Figures 63 and 65, while placing the address complements on the 
common probes are illustrated in Figures 64 and 66. 

Figures 63-64 illustrate schematically (using pictures of 4 cells) a 
cancer where the tumor cells O^gged edges) have 4 copies each of one tumor gene 
allele (G), one copy each of the other tumor gene allele (T), and one copy each of the 
30 normal gene alleles (G, A). The normal cells (ovals) have one copy each of the tumor 
gene alleles (G, T), and one copy each of the normal gene alleles (G, A). By using 
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representational PCR/LDR with addressable array capture (as described above), one 
can demonstrate that the one tumor gene allele (C) is present at a higher ratio (i.e. 2.5) 
than the other tumor gene allele as well as the other normal alleles, even in the 
presence of 50% stromal contamination. Thus, that allele is amplified. 
5 In particular, after the sample of cells is treated to recover its 

constituent DNA, which is PGR amplified, the amplified DNA is subjected to an LDR 
procedure. In Figure 63, the discriminating base is on the oligonucleotide probe with 
a different addressable array-specific portion for each different discriminating base, 
while the other oligonucleotide probe is always the same and has the same label. 

1 0 Figure 64 has the discriminating base on the oligonucleotide probe with the label with 
different labels being used for each different discriminating base, while the other 
oligonucleotide probe is always the same and has the same addressable array-specific 
portion. In either case, whether distinguished by hybridizafion at different array 
locations using the same label or by hybridization at any location with each ligation 

15 product being distinguished and identified by its label, it is apparent that there is a 
ratio of C to T alleles of 2.5 and a ratio of G to A alleles of 1.0. 

Figures 65-66 illustrate schematically (using pictures of 5 cells) a 
cancer where the tumor cells Gagged edges) have no copies each of one tumor gene 
allele (T), one copy each of the other tumor gene allele (C), and one copy each of the 

20 normal gene alleles (G, A). The normal cells (ovals) have one copy each of the tumor 
gene alleles (C, T), and one copy each of the normal gene alleles (G, A). By using 
representational PCR/LDR with addressable array capture (as described above), one 
can demonstrate that the one timior gene allele (T) is present at a lower ratio (i.e. 0.4) 
than the other tumor gene allele as well as the other normal alleles, even in the 

25 presence of 40% stromal contamination. Thus, that allele has been lost, i.e. the cell 
has imdergone loss of heterozygosity. 

In particular, after the sample of cells is treated to recover its 
constituent DNA, which is PGR amplified, the amplified DNA is subjected to an LDR 
procedure. In Figure 65, the discriminating base is on the oligonucleotide probe with 

30 a different addressable array-specific portion for each different discriminating base, 
while the other oligonucleotide probe is always the same and has the same label. 
Figure 66 has the discriminating base on the oligonucleotide probe with the label with 
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different labels being used for each different discriminating base, while the other 
oligonucleotide probe is always the same and has the same addressable array-specific 
portion. In either case, whether distinguished by hybridization at different array 
locations using the same label or by hybridization at any location with each ligation 
5 product being distinguished and identified by its label, it is apparent that there is a 
ratio of C to T alleles of 2.5 and a ratio of G to A alleles of 1 .0. 

For each example, 10% of the probes containing an addressable array- 
specific portion are labeled with a fluorescent group (F2 in Figures 63 and 65, F3 in 
Figures 64 and 66). To illustrate that LDR ligation efficiencies are not always 

10 identical among two alleles of a given gene, in each example, the ratio of C:T tumor 
gene allele ligations in the normal cells will be set at 60%:40%; while the ratio of G: A 
control gene allele ligations in the normal cells will be set at 45%:55%. To simplify 
the calculations, the chromosomes observed in the illustration will be multiplied by 
1 ,000 to obtain a representative value for the amount of ligation product formed in 

1 5 arbitrary fluorescent imits. In addition, the total number of probes containing an 

addressable array-specific portion in a reaction will be arbitrarily set at 100,000, such 
that 10% of 100,000 = 10,000 labeled addressable array- specific portion (although not 
all addresses) will be equally captured. The calculations for the analyses of 
Figures 63-66 are set forth in Figures 67-70, respectively. 

20 Further, to illustrate that the technique is independent of either array 

address spotting or hybridization kinetics, the percent of probes captured will be 
randomly varied between 30% and 60%. This concept will work even in the absence 
of a "control" fluorescent label on either the addressable array-specific portion 
(described herein. Figure 62) or fluorescent label on the array addresses. This may be 

25 achieved by printing two sets of identical arrays on the same polymer surface side-by- 
side, where both polymer and amount spotted at each address is relatively consistent, 
using the first array for the tumor sample, and the second array for the normal control. 
Alternatively, the same array may be used twice, where results are quantified first 
with the tumor sample, then the array is stripped, and re-hybridized with the normal 

30 sample. 
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Large scale detect ion of SNPs using Drdi island representations a nd DNA array 
capture: Use in association studies. 

The above sections emphasized the use of SNPs to detect allelic 
5 imbalance and potentiall)' LOH and gene amplification associated with the 

development of colorectal cancer. The PCR/LDR addressable array scheme may also 
aid in finding low risk genes for common diseases using "identity by descent" 
(Lander, E.S., "The New Genomics: Global Views of Biology," Science . 
274(5287):536-9 (1996) and Risch, et al., "The Future of Genetic Studies of Complex 

1 0 Human Diseases/' Science . 273(5281): 1516-7 (1996), which are hereby incorporated 
by reference). In ethnic populations, chromosomal regions in common among 
individuals witli the same disease may be localized to approximately 2 MB regions 
using a combination of genome mismatch scanning and chromosomal segment 
specific arrays (Cheung, et al., "Genomic Mismatch Scanning Identifies Human 

15 Genomic DNA Shared Identical by Descent," Genomics . 47(l):l-6 (1998); Cheung, 
et al., "Linkage-Disequilibrium Mapping Without Genotyping," Nat Genet . 
18(3):225-230 (1998); McAllister, et al., "Enrichment for Loci Identical-by-Descent 
Between Pairs of Mouse or Human Genomes by Genomic Mismatch Scanning," 
Genomics . 47(1):7-1 1 (1998); and Nelson, et al., "Genomic Mismatch Scanning: A 

20 New Approach to Genetic Linkage Mapping," Nat Genet . 4(1): 1 1-8 (1993), which are 
hereby incorporated by reference). SNPs near the disease gene (i.e. in linkage 
diseqtiilibrium) will demonstrate allele imbalance compared with the unaffected 
population. If the SNP is directly responsible for increased risk, then the allele 
imbalance will be much higher, e.g., the APCI1307K polymorphism is found in 6% in 

25 the general Ashkenazi Jewish population, but at approximately 30% among 

Ashkenazi Jews diagnosed with colon cancer, who have a family history of colon 
cancer (Laken, et al., "Familial Colorectal Cancer in Ashkenazim Due to a 
Hypermutable Tract in APC," Nature Genetics . 17(l):79-83 (1997), which is hereby 
incorporated by reference). If the actual T -> A transversion responsible for the 

30 condition has been identified, then a SNP analysis to demonstrate allele imbalance 
will be observed by comparing allele fi-equency in up to 20 unaffected individuals 
(94% T, 6% A alleles) to those affected individuals with a family history (70% T, 
30% A allele). 
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Alternatively, suppose the SNP is an ancestral G,A polymorphism 
found on a Drdl island near the APC gene (with allele frequencies of 0.5) which 
predates the founder T -> A transversion. Suppose this event occured in the A allele, 
termed A*, and is in linkage disequilibrium, i.e. recombination has not altered the 
5 ancestral haplotype (Lander, E. S., "The New Genomics: Global Views of Biology,'' 
Science . 274(5287):536-9 (1996) and Risch et al., "The Future of Genetic Studies of 
Complex Human Disease " Science . 273(528 1):151 6-7 (1996), which are hereby 
incorporated by reference). Then, the allele frequencies are: G = .5 , A = .44, and A* 
= 0.06. Expanding the formula (p + q + r)2 = 1 gives expected genotype frequencies 

10 of GA - 0.44, GG = 0.25, AA = 0.19, GA* = 0.06, AA* = 0.05, and A* A* = 0.004. 

To illustrate the predicted allele imbalance at this ancestral G,A 
polymorphism, compare predicted allele frequencies in 1,000 normal individuals and 
1 ,000 disease individual with a family history of colon cancer. Then for the normals, 
1 ,000 chromosomes will be scored as the G allele and 1,000 chromosomes will be 

1 5 scored as the A allele (containing 880 "A" and 120 "A*")- Among the affected 
individuals v^th a family history, approximately 30% (Laken, et al., "Familial 
Colorectal Cancer in Ashkenazim Due to a Hypermutable Tract in APC," Nature 
Genetics . 17(l):79-83 (1997), which is hereby incorporated by reference) or 300 
individuals contain the A* allele (comprised of GA*, AA*, or A* A*) and the 

20 remaining 70% or 700 individuals do not (comprised of GG, AA, or GA). The 

number of individuals for each genotype is determined by the number of individuals 
in category x expected genotype frequency / total of genotype frequency in category. 
For example, the number of individuals with GA = 700 x 0.44 / 0.88 = 350. Other 
values are: GG = 196; AA = 156; GA* = 159, AA* = 132, and A* A* - 9 (This 

25 calculation assumes that A* A* has the same risk as AA*; the number is small enough 
to be inconsequential). Summation of the number of each allele yields 350 + (196 x 
2) + 159 = 901 G alleles and 350 + (156 x 2) + 159 + (132 x 2) + (9 x 2) = 1,099 A 
alleles, or approximately a 45% G: 55% A allele imbalance. Observation of this 
imbalance in 400 affected individuals ( = 800 alleles) would have a p value of 0.005. 

30 Thus, for isolated populations (e.g., Ashkenazi Jews), evaluation of 

allele imbalance at ancestral polymorphisms by comparing unaffected with affected 
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individuals has the potential for identifying nearby genes with common 
polymorphisms of low risk. Evaluation of multiple SNPs using PCR/LDR with DNA 
£irray detection should aid this analysis. Since the SNP arrays are quantitative, it may 
be possible to detennine allele frequency from pooled DNA samples. Allele number 
5 from 4 combined individuals may be calculated by quantifying allele ratios, i.e. ratio 
of 1 : 1 =4:4 for the two alleles; ratio of 1 : 1 .67 = alleles of 3 :5 ; ratio of 1 :3 = alleles of 
2:6; ratio of 1 :7 = alleles of 1 :7; and if one allele is absent then the other is present on 
all 8 chromosomes represented in the pooled sample. Such ratios may be 
distinguished using array detection, which would reduce the above experimental 

1 0 analysis to evaluation of 1 00 pooled normal and 100 pooled affected samples. 

A complete set of about 100,000 SNPs will place a SNP every 30 kb. 
This would require 25 arrays of 4,096 addresses. When comparing association for 
400 disease individuals with 400 normal controls, this would require 20,000 array 
scans and provide the data on 80,000,000 SNPs in the population. PGR and LDR 

1 5 reactions take 2 hours each, but may be done in parallel. The current scheme would 
only require 20,000 PGR reactions, followed by 20,000 LDR reactions, and finally 
20,000 DNA array hybridizations (1 hr), and scannings (a few minutes per array). 
This is far more efficient than the cunrent technology which evaluates one SNP at a 
time. 

20 The SNP DNA array analysis sunultaneously provides predicted 

association for all the affected genes of any prevalent disease (e.g., Alzheimers, heart 
disease, cancer, diabetis). It will find both positive and negative modifier genes, it 
will find genes with low penetrance increase for risk, and will map to within 30 kb of 
all genes which influence the disease. This approach will allow for pinpointing 

25 additional polymorphisms within the disease associated genes, opening the prospect 
for customized treatments and therapies based on pharmacogenomics. 
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EXAMPLES 

Example 1 - Demonstration of T4 DNA Ligase Fidelity in Ligating 

Linker/Adapters to only their Complementary 2 base 3' Overhangs 
5 Using Synthetic Targets. 

Ligation reactions with T4 DNA ligase and a variety of linker/adapters 
(GG-, AA-, AG-, and GA-) and synthetic targets (Tables 13 and 14) were performed 
to determine the fidelity of T4 DNA ligase under various experimental conditions. 

10 

Table 13. Drdl and MsplTaq Bubble linkers and PGR primers for BAG clones 



Primer Sequence (5 '-^3') 



BAA2 9 




5' 


TAG ACT GCG 


TAC 


TCT 


AA 3' (SEQ. 


ID, 


No. 77) 


BAA3 034R 


5' 


pA GAG TAC GCA GTC TAG GAG TCA GG 3 


' (SEQ. ID. 






No, 


78) 












BAAP31 




5' 


CCT GAG TCG 


TAG 


ACT 


GCG TAC TCT 


AA 


3' (SEQ. ID. 


BAAP32- 


FAM 


HQ- 


79) 
















S' 


FAM- CCT GAG 


TCG 


TAG 


ACT GCG TAC 


TCT 


AA 3' (SEQ. 






ID. 


No, 80) 












BAC33 




5' 


TAG ACT GCG 


TAC 


TCT 


AC 3' ; (SEQ. 


ID. 


No. 81) 


BACP3 5 




5' 


CCT GAG TCG 


TAG 


ACT 


GCG TAC TCT 


AC 


3' (SEQ. ID. 






No. 


82) 












BACP36- 


FAM 


5' 


FAM- CCT GAG 


TCG 


TAG 


ACT GCG TAC 


TCT 


AC 3' (SEQ. 






ID. 


No. 83) 












BAG37 




5' 


TAG ACT GCG 


TAC 


TCA 


AG 3' (SEQ. 


ID. 


No. 84) 


BAG37b 




5' 


Biotin-C18-ACT GAG TCG TAG ACT GCG 


TAC TCA AG 3' 






(SEQ. ID. No, ) 


B5) 










BAG38R 




5' 


pT GAG TAC GCA GTC TAC GAC TCA GT 3 


' (SEQ. ID. 






No. 


86) 












BAGP3 9 




5' 


ACT GAG TCG 


TAG 


ACT 


GCG TAC TCA 


AG 


3' (SEQ. ID. 






No. 


87) 












BAGP40- 


FAM 


5' 


FAM-ACT GAG 


TCG 


TAG 


ACT GCG TAC 


TCA 


AG 3' (SEQ. 






ID. 


, No. 88) 












BCA41 




5' 


TAG ACT GCG 


TAC 


TCT 


CA 3' (SEQ. 


ID. 


No. 89) 


BAC41b 




5' 


Biotin-C18-ACT GAG TCG TAG ACT GCG 


TAC TCT CA 3' 






(SEQ. ID. No, : 


90) 










BCA4246R 


5' 


pA GAG TAC GCA GTC TAC GAC TCA GT 3 


' (SEQ. ID. 






No. 


. 91) 












BCAP43 




5' 


ACT GAG TCG 


TAG 


ACT 


GCG TAC TCT 


CA 


3' (SEQ. ID. 






No. 


. 92) 












BCAP44- 


■FAM 


5' 


FAM-ACT GAG 


TCG 


TAG 


ACT GCG TAC 


TCT 


' CA 3' (SEQ. 






ID. 


. No. 93) 












BGA45 




5' 


TAG ACT GCG 


TAC 


TCT 


GA 3' (SEQ. 


ID. 


No. 94) 


BGAP47 




5' 


ACT GAG TCG 


TAG 


ACT 


GCG TAC TCT 


GA 


3' (SEQ. ID. 






No. 


. 95) 
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BGAP48-FAM 5' FAM-ACT GAG TCG TAG ACT GCG TAG TCT GA 3' (SEQ. 

ID. No. 96) 

BGG49 5' TAG ACT GCG TAC TAT GG 3' (SEQ. ID. No. 97) 

BGG50R 5' pA TAG TAC GCA GTC TAC GAC TCA GT 3' (SEQ. ID. 

No. 98) 

BGGP51 5' ACT GAG TCG TAG ACT GCG TAC TAT GG 3' (SEQ. ID. 

No. 99) 

BGGP52-FAM 5' FAM-ACT GAG TCG TAG ACT GCG TAC TAT GG 3' (SEQ. 

ID. No. 100) 



Table 14. Targets for ligation experiments in synthetic system. 



Primer Sequence (5'->3') 



L53FL 


5' pCAT TCA GGA CCT GGA TTG GCG A- 
(SEQ. ID. No. 101) 


Fluoroscein 3' 


TT54R-FAM , 


5' 
No. 


Fam-TCG CCA ATC 
. 102) 


CAG 


GTC CTG 


AAT 


GTT 


3' (SEQ. ID. 


CC55R-FAM 


5' 

No. 


Fam-TCG CCA ATC 
. 103) 


CAG 


GTC CTG 


AAT 


GCC 


3' (SEQ. ID. 


CT56-FAM 


5' 
ID, 


Fam-attaTCG CCA 
. No. 104) 


ATC 


CAG GTC 


CTG 


AAT 


GCT 3' (SEQ. 


TC57-PAM 


5' Fam-attaattaTCG 
(SEQ. ID. No. 105) 


CCA 


ATC CAG 


GTC 


CTG 


AAT GTC 3' 



5 

Synthetic targets were fluorescently labeled with Fam and of different lengths such 
that correct perfect match from unwanted mismatch ligations could be distinguished 
when separating products on a sequencing gel. Reactions were performed in a 20 |iL 

10 volume in a modified T4 DNA ligase buffer (20 mM Tris-HCl (pH 7.5), 1 0 mM 

MgCh, 10 mM dithiothreitol, 1 mM dATP, and 2.5 |ag/ml BSA) and contained 5 nM 
ligation target. Products were separated on a denaturing polyacrylamide sequencing 
gel and quantified using an ABI 373 automated sequencer and GENESCAN software, 
The effect of T4 DNA ligase enzyme concentration (100 U or 400 U, New England 

1 5 Biolabs units), KCl concentration (50 mM or 100 mM), linker/adapter concentration 
(50 or 500 nM linker/adapter), temperature (15°C or 37°C), and time (1 hr or 16 hr) 
on T4 ligase fidelity and activity was examined. 

All of the reactions generated the correct ligation product with no 
detectable misligation product (Figure 71). The total concentration of linker/adapter 

20 and KCl concentration sometimes had an effect on overall activity. From these 
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assays, the optimal conditions for ligation reactions associated with the Drdl 
representational approach was determined to be 100 U T4 DNA ligase (New England 
Biolabs units), 500 nM linker/adapter, 50 mM KCl, 20 mM Tris-HCl (pH 7.5), 10 
mM MgCh, 10 mM dithiothreitol, 1 mM dATP, and 2.5 ^ig/ml BSA in a 20 \xL 
5 reaction incubated at 37°C for 1 h. This condition is the preferred condition, because 
it is compatible with the restriction enzymes used to generate Drdl representations. 
Although this condition is optimal for T4 DNA ligase, detectable activity was 
observed under all of the tested combinations of parameters listed above. For other 
linker adapter sequences of restriction enzyme overhangs, conditions may be 
1 0 optimized using this assay. 



Example 2 - Demonstration of Restriction Digestion and Specific Ligation of 
Linker/Adapters to their Complementary Overhangs Followed by 
PCR Amplification of the Correct Fragment. 

15 

Specificity and reproducibility of Drdl Restriction/Ligation/PCR were 
tested in two vectors (pBeloBACl 1 and pBACe3.6) and a BAG clone. BAG DNA (5- 
10 ng) was digested with Drdl, Mspl, and Taql and, simultaneously, ligated with 500 
nM of the appropriate linker/adapters in the presence of T4 DNA ligase. 

20 Linker/adapters containing 2 base 3' overhangs complementary to the Drdi site 

(BAA29 + BAA3034R for AA overhangs, BAC33 + BAA3034R for AG overhangs, 
BAG37 + BAG38R for AG overhangs, BGA41 + BGA4246R for GA overhangs, 
BGA45 + BCA4246R for GA overhangs, and BGG49 + BGG50R for GG overhangs) 
are listed in Table 13. Linker/adapters containing 2 base 5' overhangs 

25 complementary to the GG overhang oiMspl or Taql sites (MTGG225 + MTGG0326R 
or MTGGp326R) are listed in Table 8. The MTGG225/MTGG0326R and 
MTGG225/MTGGp326R linker adapters contain a bubble to avoid unwanted Mspl- 
Mspl, Taql-Mspl, or Taql-Taql fragment amplifications. This digestion/ligation 
reaction was performed in a buffer containing 20 mM Tris-HGl (pH 7.5), 10 mM 

30 MgGl2, 50 mM KGl, 10 mM dithiothreitol, 1 mM dATP, and 2.5 ng/ml BSA. 

Reactions were incubated at 37°G for one hour followed by an 80^G incubation for 20 
min in order to heat inactivate the enzymes. Since Taql is a thermophilic enzyme, 10- 
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fold more units were used to counterbalance the 10-fold lower activity at 37°C. This 
enzyme is fully inactivated by the above heating step. 

To remove fragments and linkers with sizes smaller than 100 bps, the 
digestion/ligation reaction was microcentrifuged with an Amicon YM-50. First, the 
5 sample was centrifuged at 8000 rpm for 8 min, then the filter was inverted and the 
desired products were recovered by centrifuging at 6000 rpm for 3 min. After 
recovery, the sample volume was brought up to 20 |iL v^th ddH20 for PCR 
amplification. 

PCR reactions contained the YM-50 purified digestion/ligation 
10 reaction (20 ^1), Ix PCR buffer (10 mM Tris-HCl (pH 8.3), 50 mM KCI), 4 mM 

MgCb, 0.4 mM dNTPs, 1.25 U AmpIiTaq Gold, and 0.5 ^iM PCR primers in a 50 ^il 
reaction. The PCR reactions were initially incubated at 95°C for 10 min (to activate 
AmpIiTaq Gold polymerase) followed by 35 cycles of 94°C, 15 sec; 65°C, 2 min. 

Assays performed v\ath pBeloBACl 1 or pBACe3.6 resulted in even 
15 amplification of 2 fi-agments for GA- overhangs and 1 fragment each for A A- or CA- 
overhangs as predicted based on the presence of these overhangs in the plasmids. 
Similar assays were performed with BAC RG253B13 and also generated the expected 
results (2 fragments for GA- overhangs and 3 fragments for AA- overhangs 
respectively, see Figure 46). The larger 3,419 bp GA fragment was not observed, 
20 because it was not expected to be amplified. These results demonstrate that the 
restriction digestion was sufficiently complete and the ligation and PCR reactions 
were specific for the desired products. 

Example 3 - Suppression of Amplification of Vector Derived Sequence while 
25 Amplifying the Correct Fragment. 

The PCR amplification ofDrdl fragments derived from the vector 
sequence were suppressed using PNA or propynyl clamping oligos. A slightly 
modified protocol was used when PCR amplifying Drdl fragments containing A A, 
30 CA, or GA overhangs firom B ACs derived fix)m the pBeloB AC 1 1 or pB ACe3 .6 
vector. The pBeloBACl 1 and pBACe3.6 vectors both contain Drdl sites 
complementary to AA-, CA-, and GA- overhangs, and amplification of these vector 
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fragments needed to be suppressed. Clamping oligos which bind specific Drdl 
fragments (i.e. vector derived) and block annealing of PGR primers, were designed as 
PNA or propynyl derivatives (Tables 5 and 6). 

BAG DNA (5-10 ng) was digested with Drdl, Mspl, and Taql and 
5 simultaneously ligated with 500 nM of the appropriate linker/adapters in the presence 
of T4 DNA ligase in a buffer containing 20 mM Tris-HCl (pH 7.5), 10 mM MgGb, 50 
mM KGl, 10 mM dithiothreitol, 1 mM dATP, and 2.5 ^g/ml BSA. Reactions were 
incubated at 37°C for one hour followed by an 80°C incubation for 20 min in order to 
heat inactivate the enzymes. Fragments and excess linker/adapter less than 100 bp 

10 were removed by ultrafiltration on Amicon YM50 filters as described above. PGR 
reactions contained the YM-50 purified digestion/ligation reaction (20 |al), Ix PGR 
buffer (10 mM Tris-HGl (pH 8.3), 50 mM KGl), 4 mM MgCb, 0.4 mM dNTPs, 1.25 
U AmpliTaqGold, 1 \iM of clamping oligos, and 0.5 ^M PGR primers in a 50 ^1 
reaction. The PGR reactions were initially incubated at 95 °G for 10 min (to activate 

15 AmpliTaq Gold polymerase) followed by 35 cycles of 94°G, 15 sec; 65°G, 2 min. 
Drdl Restriction/Ligation/PGR assays were performed with pBAGe3.6 and 1 yiM 
clamping oligos. In one reaction, AA- linker/adapters were ligated to digested 
vector. This sample was PGR amplified, in the presence of a AA- clamping oligo 
specific for suppressing amplification of AA-Drdl firagment associated with only the 

20 vector sequence. No vector derived PGR product was observed with both the PNA 
and propynyl clamping oligos. In a subsequent experiment, GA- and AA- 
linker/adapters were present simultaneously in the digestion/ligation reaction of 
pBAGe3.6. This reaction was then PGR amplified in the presence of 1 fiM AA- 
clamping oligo (either PNA or propynyl derivative). No AA-product was observed 

25 v/ith both the PNA and propynyl clamping oligo, but the amplification of the GA- 
fi^gment was unaffected by the presence of the AA- clamp. Similar assays were 
performed with BAG RG253B13 and also generated the expected number of 
amplified fragments, depending on which clamps were being used. These results 
demonstrate the ability of PNA or propynyl clamping oligos to specifically suppress 

30 amplification of an undesired fragment, while having no measurable effect on the 
amplification of desired firagments. 
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Example 4 - Enrichment of Drdl Representational Fragments Using 

Biotinylated Linker/ Adapters and Streptavidin Purification. 

5 Creation of a library of representational fragments is required to 

rapidly sequence those fragments and discover SNPs. While a PGR amplification 
reaction may enrich for a particular representation, there also is the possibility of 
generating false SNPs through polymerase error. An approach to minimizing false 
SNPs is to pre-select the representational fragments, and/or avoid amplification 
10 altogether. This may be achieved by using biotinylated linker/adapters to a specific 
Drdl overhang, followed by purification of only those fragments using streptavidin 
beads. 

While genomic DNA will ultimately be used for this task, BAG DNA 
was used in this example since proof of the correct selection is easily achieved by 

1 5 demonstrating that the correct fragments amplified, BAG DNA (5-10 ng) was 

digested with Drdl, Mspl, and Taql and simultaneously ligated with 500 nM of the 
appropriate linker/adapters in the presence of T4 DNA ligase. Linker/adapters 
containing 2 base 3' overhangs complementary to the Drdl site (BAG37b + BAG38R 
for AG overhangs and BGA41b + BGA4246R for CA overhangs) are listed in 

20 Table 13. Linker/adapters containing 2 base 5' overhangs complementary to the GG 
overhang ofMspl or Taql sites (MTCG225 + MTGG0326R or MTGGp326R) are 
listed in Table 8. The MTGG225/MTGG0326R and MTGG225/MTCGp326R linker 
adapters contain a bubble to avoid unwanted Mspl-Mspl, Toql-Mspl, or Taql-Taql 
firagment amplifications. This digestion/ligation reaction was performed in a buffer 

25 containing 20 mM Tris-HCl (pH 7.5), 10 mM MgCh, 50 mM KGl, 10 mM 

dithiothreitol, 1 mM dATP, and 2.5 fig/ml BSA. Reactions were incubated at 37'*G 
for one hour followed by an 80°C incubation for 20 min in order to heat inactivate the 
enzymes. Fragments and excess linker/adapter less than 100 bp were removed by 
ultrafiltration on Amicon YM50 filters as described above. 

30 The purification procedure was as follows: (streptavidin magnetic 

beads and the purification protocol were obtained from Boehringer Mannheim, 
Indianapolis, Indiana) 10 ^1 of (lOjig/^l) magnetic beads were washed three times 
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with binding buffer TENioo (10 mM Tris-HCl (pH7.5), ImM EDTA, lOOmM NaCl). 
The sample (YM-50 purified digestion/ligation reaction) volume was brought up to 
100 ^1 in binding buffer and incubated with washed beads for 30 min (constantly 
shaking using a neutator or rotating platform). The pellet was washed 2 times with 
5 TENiooo (10 mM Tris-HCl (pH7.5), ImM EDTA, lOOOmM NaCl) and then washed 
once in Ix PCR buffer (10 niM Tris-HCl (pH 8.3), 50 mM KCl), 4 mM MgCb). The 
sample was eluted in 30 jil Ix PCR buffer by incubating at 95**C for 5 min, capturing 
the beads in the magnetic stand for 30 sec at 95^C, followed by immediate removal of 
the supernatant at the bench. After the streptavidin purification, dNTPs (0.4 mM final 
1 0 concentration), PCR primers (0.5 jiM final) and ddHaO is added to the purified 

sample to increase the volume to 50 jal. AmpliTaqGold (1 .25U) is added, with PCR 
reactions initially incubated at 95°C for 10 min (to activate AmpliTaq Gold 
polymerase), followed by 35 cycles of 94°C, 15 sec; 65°C, 2 min. 

In assays with pBACe3.6, biotinylated CA- linker/adapters, and non- 
15 biotinylated AA linker/adapters, streptavidin purification resulted in only the CA- 
linker firagment being PCR amplified. Conversely, both CA- and AA- linker 
fi-agments were amplified in the control assay without the streptavidin purification 
step. This result demonstrates that streptavidin purification can be utilized to enrich 
for specific linker/adapter products prior to the PCR amplification. 

20 

Example 5 - Amplification of Drdl Representations from the 5. cerevisiae 
Genome. 

The more complex S. cerevisiae genome (16 Mb) was chosen as a 
25 more complex model system than individual BACs, but still at 1/200^*^ the complexity 
of the human genome. 100 ng of S. cerevisiae genomic DNA was subjected to the 
same protocol as the BAC DNA as described above. Digestion/ligation reactions 
were PCR amplified using 7 separate primers with either 2 or 3 base selectivity (AC, 
CA, OA, AG, GG, CAG, and CAT). A fragment appeared as a band above 
30 background in the CA- representation, suggesting the presence of a repetitive element. 
This band was 2- to 4-fold stronger in the CAG representation, yet absent in the CAT 
representation. This indicates that PCR primers can also be utilized to alter the size 
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and complexity of a representation. Inclusion of a size filtration step (Amicon YM- 
50) before PCR amplification resulted in amplification of a broader representation 
(based on size) as assayed on an agarose gel. 

5 Example 6 - Amplification of Drdl Representations from the Human Genome. 

Hxmian DNA has a complexity of 3,500 Mb, and is predicted to 
contain about 300,000 Drdl sites. A Drdl representation using three bases of 
selectivity should amplify about 8,750 fragments, yielding about 0.2% of the genome. 

10 A Drdl representation using four bases of selectivity should amplify about 2,200 
fragments, yielding about 0.05% of the genome. 1 00 ng of human genomic DNA 
obtained from Boehringer-Maimheim was digested with lOU Drdl, 20U Mspl, and 
lOOU Taql and simultaneously ligated with 500 nM of the appropriate Drdl 
linker/adapter and 1,000 nM of the MspVTaql linker/adapter in the presence of T4 

1 5 DNA ligase. Linker/adapters containing 2 base 3' overhangs complementary to the 
Drdl site (BAG37 + BAG38R for AG overhangs, and BCA41 + BCA4246R for CA 
overhangs) are listed in Table 13. Linker/adapters containing 2 base 5' overhangs 
complementary to the CG overhang of Mspl or Taql sites (MTCG225 + 
MTCG0326R) are listed in Table 8. This digestion/ligation reaction was performed in 

20 a buffer containing 20 mM Tris-HCl (pH 7.5), 1 0 mM MgCh, 50 mM KCl, 1 0 mM 
dithiothreitol, 1 mM dATP, and 2.5 ^ig/ml BSA. Reactions were incubated at 3TC 
for one hour followed by an 80*^C incubation for 20 min in order to heat inactivate the 
enzymes. Fragments and excess linker/adapter less than 100 bp were removed by 
ultrafiltration on Amicon YM50 filters as described above. 

25 PCR reactions contained the YM-50 purified digestion/ligation 

reaction (20 \xl\ Ix PCR buffer (10 mM Tris-HCl (pH 8.3), 50 mM KCl), 4 mM 
MgCl2. 0.4 mM dNTPs, 1.25 U AmpliTaqGold, and 0.5 nM PCR primers in a 100 ^il 
reaction. The PCR primer on the MspVTaql side was MTCG228 and is listed in 
Table 8. The PCR primers on the Drdi side were<x)mplementary to the 

30 linker/adapter, and had either 3 or 4 bases of specificity (e.g. primer CATP58 = 3 base 
CAT specificity, primer CAGP59 = 3 base CAG specificity, primer AGAP60 = 3 base 
AG A specificity, primer AGAP61 = 3 base AGC specificity, primer AGATP62 = 4 
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base AGAT specificity, primer AGAGP63 = 4 base AGAG specificity, primer 
CATGP64 = 4 base CATG specificity, and primer CAGTP65 = 4 base CAGT 
specificity) and are listed in Table 15. 

5 Table 15. PGR primers for representational PGR /LDR/Arrays. 



Primer Sequence {5'->3') 



CATP58 


5' 
No. 


CT GAG TCG 
106) 


TAG 


ACT 


GCG 


TAC 


TCT 


CAT 


3' 


{SEQ. ID. 


CAGP59 


5' 
No. 


CT GAG TCG 
107) 


TAG 


ACT 


GCG 


TAC 


TCT 


CAG 


3 ' 


(SEQ. ID. 


AGAP6 0 


5' 
No. 


CT GAG TCG 
108) 


TAG 


ACT 


GCG 


TAC 


TCA 


AGA 


3' 


(SEQ. ID.' 


AGCP61 


5' 
No. 


CT GAG TCG 
109) 


TAG 


ACT 


GCG 


TAC 


TCA 


AGC 


3' 


(SEQ. ID. 


AGATP62 


5' 
ID. 


CT GAG TCG 
No. 110) 


TAG 


ACT 


GCG 


TAC 


TCA 


AGA 


T 


3' (SEQ. 


AGAGP63 


5' 
ID. 


CT GAG TCG 
No. Ill) 


TAG 


ACT 


GCG 


TAC 


TCA 


AGA 


G 


3' (SEQ. 


CATGP64 


5' 
ID. 


CT GAG TCG 
No. 112) 


TAG 


ACT 


GCG 


TAC 


TCT 


CAT 


G 


3' (SEQ. 


CAGTP6 5 


5' 
ID. 


CT GAG TCG 
No. 113) 


TAG 


ACT 


GCG 


TAC 


TCT 


CAG 


T 


3' (SEQ. 



The "regular PGR" reactions were initially incubated at 95*'C for 10 min (to activate 
10 AmpliTaq Gold polymerase) followed by 35 cycles of 94'*C, 15 sec; 65''C, 2 min. 
Another set of PGR condition called "touchdown PGR" was tested in addition to the 
"regular PGR" as described previously. The "touchdown PGR" protocol consisted of 
heating for 10 min at 95'*G followed by 8 cycles of denaturing for 1 5 sec at 94°G, 
annealing/extension for 2 min at 72®G. The annealing/extension temperature was 
1 5 reduced 1°G for each cycle until a fmal temperature of 64°C. Another 30 cycles of 
PGR were performed vnth denaturing 15 sec at 94''G and annealing/extension for 2 
min at 64°G. Each sample was performed in quadruplicate, and the 400 ul PGR 
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products were pooled and concentrated by ultrafiltration on Amicon YM50 filters as 
described above. Final samples were brought up in 20 \il TE. 

PCR amplification of human genome representations (CA- or AG- 
linker/adapters) were performed with a variety of 3 and 4 base selection primers (e.g., 
5 CAG, CAT, CAGT, CATG, AGC, AGA, AGAT, and AGAG). The agarose gel 

analysis demonstrated apparently equal and broad representation for each of the above 
PCR primers (Figure 72). 

To verify that these human genomic Drdl representations were 
selecting the appropriate firagments, LDR assays were performed to probe for specific 

10 fragments within a given representation. LDR conditions used 4 |il of the 

concentrated representational firagments from the above mentioned PCR reactions, 1 x 
m DNA ligase buffer (20 mM Tris-HCl pH 8.5, 5 mM MgCh, lOOmM KCl, 1 mM 
DTT, 1.25 mM NAD""), 2.5 nM LDR probes. Tth DNA ligase (in buffer containing 
lOmM Tris-HCl pH8.0, ImM EDTA, Img/ml BSA) was added to the reaction to a 

1 5 final concentration of 5 nM. The LDR reaction was carried out with 20 cycles of 
heating at 95''C for 1 5 sec and ligation at 64''C for 2 min. Three microliters of the 
LDR reaction product was loaded on the gel and the gel image was read by GeneScan 
Analysis 2.02. Control assays containing PCR products generated fi-om primers 
(Tables 16 and 17) designed for each of the targeted regions demonstrated the 

20 integrity of LDR assays (Figure 73). 



Table 16. Primers Designed for Detection of Polymorphisms Near Drdl Sites 



by PCR/LDR. 




Primer 


Sequence (5'-»3') 


Uni 
Uni 


A primer 
32 primer 


GGAGCACGCTATCCCGTTAGAC (SEQ. ID. No. 114 
CGCTGCCAACTACCGCACATC (SEQ. ID. No. 115 


B13 
B13 


AGA fpl 
AGA rpl 


6GAGCACGCTATCCCGTTAGACCCCTGCAATGACTCCCCATTTC 
(SEQ. ID. No. 116) 

CGCTGCCAACTACCGCACATCAGTAGGGCTGGGGCATCAGAAC 
(SEQ. ID. No. 117) 


B13 
B13 


AGA Faml (F-1) 
AGA -Coml (C-1) 


Fam aGCTTCAGACACACCAGGCAC =47 (SEQ. ID. 
No. 118) 

pATTTAGTTCTTCCTTCTTGCCTCTGC-Bk (SEQ. ID. 
No. 119) 
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B13 AGC fp2 GGAGCACGCTATCCCGTTAGACATTGTGGAAGACAGTGTGGTGAT 

TC <SEQ. ID. No. 120) 
B13 AGC rp2 CGCTGCCAACTACCGCACATCCATGGCATATATGTGCCACATTTT 

C {SEQ. ID. No. 121) 

B13 AGC Fam2 (F-2) FamAAGCATGCTGCTGTAAAGACACA =52C (SEQ. ID. 

No. 122) 

B13 AGC -Com2 (C-2) PTGCACATGTATGTTTATTGCAGCACTATT-Bk (SEQ. ID. 

No. 123) 



El 9 AGC fp3 GGAGCACGCTATCCCGTTAGACGTGTTAGCCAGGATGGTCTCCAT 

C (SEQ. ID. No. 124) 
E19 AGC rp3 CGCTGCCAACTACCGCACATCCATGGGTGGGGT7VACAGAAAGAAA 

C (SEQ. ID. No. 125) 

E19 AGC Fam3 (F-3) FamGACAATTATCCTGATTTGGGACC =48C (SEQ. ID. 

No. 126) 

E19 AGC -Com3 (C-3) pTTACCTTCAGATGGTTTTCCCTCCT-Bk (SEQ. ID. 

No. 127) 



C03 AGA fp4 GGAGCACGCTATCCCGTTAGACTAGTGTCTAGGGATAGAGGAGAA 

C (SEQ. ID. No. 128) 
CO 3 AGA rp4 CGCTGCCAACTACCGCACATCCTCCTGACATTATGGAGAGCCTTA 

C (SEQ. ID. No. 129) 

C03 AGA Fam4 (F-4) FamAATGCCACACTTCAGATTTTGATAC =50 (SEQ. ID. 

No. 130) 

C03 AGA -Com4 (C-4) pTTGCAGGATCCTATTTCTGGCACTA-Blc (SEQ. ID. 

No. 131) 



Primer Sequence (5' ->3 ' ) 



UniAprimer GGAGCACGCTATCCCGTTAGAC (SEQ. ID. No. 132) 

UniB2priTOer CGCTGCCAACTACCGCACATC (SEQ. ID. No. 133) 



P20 AGA fp5 GGAGCACGCTATCCCGTTAGACGGACTTCTCCCCACTACAACATA 

GATTC (SEQ. Id/no. 134) 
P20 AGA rp5 CGCTGCCAACTACCGCACATCTTTATCAGCAACATGAAAACAGAC 

TAAC (SEQ. ID. No. 135) 

P20 AGA FamS (F-5) FamTGTGGAATTTATCATTTAATTTAGCTTC =56 (SEQ. ID. 

No. 136) 

P20 AGA -Corns (C-5) pAGTGAACCGTTCTTTCCAGATTATTTTG - Bk (SEQ. ID. 

No. 137) 



K23 AGA fp6 GOAGCACGCTATCCCGTrAGACAGAATAGAATGCTTGCAATTGAT 

CAC (SEQ. ID. No. 138) 
K23 AGA rp6 CGCTGCCAACTACCGCACATCATGTCAATTTGTTGGGGTTATACA 

AC (SEQ. ID. No. 139) 
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K23 AGA Fame (F-6) Fam aaaaAGGAGGGTGACAGTGAACCTG =53 (SEQ. ID. 

No, 140) 

K23 AGA -Corns (C-6) pGAGGTAAAATTCAACAATTCATTTGCTT-Bk (SEQ. ID. 

No. 141) 



J17 AGA fp7 
J17 AGA rp7 

J17 AGA Fam7 (F-7) 

J17 AGA -Corn? (C-7) 

AGATP62 

AGAGP63 

CATGP64 

CAGTP6 5 



GGAGCACGCTATCCCGTTAGACGTGCAGACAAGAGAATGTCAAGT 
TTC (SEQ. ID. No. 142) 

CGCTGCCAACTACCGCACATCAGAGGCTGGAAAAATAAATCCAAT 
ACA (SEQ. ID. No. 143) 

FamGATCAGAAACCACAGGAAATTTG =44 (SEQ. ID. 
No. 144) 

pATTTATGCCAGCCCTGCATCCC-Bk (SEQ. ID. No. 14 5) 

CTGAGTCGTAGACTGCGTACTCTAGAT (SEQ. ID. 
No. 146) 

CTGAGTCGTAGACTGCGTACTCTAGAG (SEQ. ID. 
No. 147) 

CTGAGTCGTAGACTGCGTACTCTCATG ( S EQ . ID . 
No. 148) 

CTGAGTCGTAGACTGCGTACTCTCAGT (SEQ. ID . 
No. 149) 



Table 17. Primers designed for detection of polymorphisms near Drdl sites by 
PCR/LDR/Array Hybridization. 



Primer 



Sequence ( 5 ' ->3 ' ) 



Uni A primer 
Uni B2 primer 



GGAGCACGCTATCCC6TTAQAC (SEQ. ID. No. 150) 
CGCTGCCAACTACCGCACATC (SEQ. ID. No. 151) 



GS056H18.2 forward • GGAGCACGCTATCCCGTTAGACGATGAGCTTACACAGGCACTGATTAC 

(SEQ. ID. No. 152) 

GS056H18 .2 reverse CGCTGCCAACTACCGCACATCTATTGGTGACTGATGAAAATGTCAAAC 

(SEQ. ID. No, 153) 



GS056H18.2 
GS056H18.2 -Com2 



Fam-tGTCAAGAAAGTGTATTTAGCTTACAAAC =58 (SEQ. ID. 
No. 154) 

PTATTAACAGCCTGTTTTACCCTACTTTTG -BJc (SEQ . ID . 
No. 155) 



RG083J23 forward GGAGCACGCTATCCCGTTAGACGCACCTTATCTTGGCTTTTCTATTC 

(SEQ. ID. No. 156) 

RG083J2 3 reverse CGCTGCCAACTACCGCACATCAAGCATATTACATCATGTCATCACTTC 

(SEQ. ID. No. 157) 

RG083J23 Fam-TTCGTTTCTCTTTATCCACACC =52 (SEQ. ID. 

No. 158) 

RG083J23 -Com3 pATGGGAAATGTCTTTTACAATGTACATAAC-Bk (SEQ. ID. 

No. 159) 



RG103H13 forward 



GGAGCACGCTATCCCGTTAGACCAGCCATGTGATTCCCTGTGTAC 
(SEQ. ID. No. 160) 
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RG103H13 reverse 



RG103H13 
RG103H13- 
RG103H13 . 
RG103H13, 



RG103H13 
RG103H13 



Com4 

2 forward 
2 reverse 



2 

2-Com5 



CGCTGCCAACTACCGCACATCCTGCATTGTACAATGCATGCATAC 
(SEQ. ID. No. 161) 

Fam-aaatataaACTAAATGAATC/^GATAGAGTGAATG =60 
(SEQ. ID. No. 162) 

pTATGCATGCATTGTACAATGCAGG-Bk (SEQ. ID. No. 163) 

GGAGCACGCTATCCCGTTAGACTTCTGATAGAGTCGTTTTGTGCTTC 
(SEQ. ID. No. 164) 

CGCTGCCAACTACCGCACATCCATTTTAGGATCTGGGAAGCATTAC 
(SEQ. ID. No. 165) 

Fam-TTTTTCCTCCCATCCAAATTC =46 (SEQ. ID. No. 166) 
pAGAGACCCTAGAATTCTAGCGATGG-Bk (SEQ. ID. No. 15 7) 



Primer 



Sequence (5' -^3' ) 



UniAprimer 
UniB2prinier 



GGAGCACGCTATCCCGTTAGAC (SEQ. ID. No. 168) 
CGCTGCCAACTACCGCACATC (SEQ, ID. No. 169) 



RG118D07 forward 
RG118D07 reverse 

RG118D07 
RG118D07 Come 

RG343P13 forward 
RG343P13 reverse 



RG343P13 
RG343P13-Com-7 



GGAGCACGCTATCCCGTTAGACCCTTGGAAAGCAGGTGCAAATC 
(SEQ. ID. No. 170) 

CGCTGCCAACTACCGCACATCAAATAACAACTGCATTACTCCATCATC 
(SEQ. ID. No. 171) 

Fam-aaTGAAAAAATCCAATATTGGTCTG =55 (SEQ. ID. 
No. 172) 

pTGTGTGAAAGTGTAAATGTATACGTGTATG- BJc ( SEQ . ID . 
No. 173) 

GGAGCACGCTATCCCGTTAGACCTGTCAAGCAGGGAATTGGATAC 
(SEQ. ID. No. 174) 

CGCTGCCAACTACCGCACATCCCTTTCTGATTTCAGTTGCTAGTTTC 
(SEQ. ID. No. 175) 

Fam-GAGACCAAACCAGGGAGAAAG =50 (SEQ. ID. No. 176) 
pTACAGAGAGAGAGCAAAGAGAGTTCAGAC-Bk (SEQ. ID. 
No. 177) 



RG363E19 .2 forward GGAGCACGCTATCCCGTTAGACTGGAGGTCCTAGCCAGAGCAAC 

(SEQ. ID. No. 178) 
RG363E19 .2 reverse CGCTGCCAACTACCGCACATCGGTATTGCCTTTCTGATTTAGCTTTC 

(SEQ. ID. No. 179) 

RG363E19.2 Fam-aGCCCAAAAGCTCCTTCAGC =48 (SEQ. ID. No. 180) 

RG363E19.2-Com-9 pTGATAAACAACTTCAGCAAAGTTTCAGG-Bk: (SEQ . ID. 

No. 181) 
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In addition, these control PCR products were diluted up to 10,000-fold 
into 10 salmon sperm DNA. Even in this vast excess of noncomplementary DNA, 
LDR assays still identified the desired products. 

The targeted Drdl- MspMTaql fragments ranged in size from 1 30 to 
5 1 ,500 bp and were derived from AG- or CA- linker/adapters. LDR assays of the 
human representational libraries demonstrated that the representations were even and 
that increasing base reach-in generated a more specific library (Figures 74 and 75). 
This resuh demonstrates that LDR is sensitive enough to identify a specific Drdl- 
MspllTaql fragment within a given representation. 
10 Altering the PCR conditions to *'touchdown" amplification resulted in 

more LDR product with no apparent change in the relative distribution of fragments. 
These results demonstrated that the Drdl representational approach was able to 
generate an even and specific representation of the human genome. 

Although the invention has been described in detail for the purpose of 
1 5 illustration, it is imderstood that such detail is solely for that purpose, and variations 
can be made therein by those skilled in the art without departing from the spirit and 
scope of the invention which is defined by the following claims. 
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WHAT IS CLAIMED: 

1 . A method of assembling genomic maps of an organism's DNA 
or portions thereof comprising: 
5 providing a library of an organism's DNA, wherein individual 

genomic segments or sequences are found on more than one clone in the library; 
creating representations of the genome; 
generating nucleic acid sequence information from the 

representations; 

1 0 analyzing the sequence information to determine clone overlap 

from a representation; and 

combining clone overlap and sequence information from 
different representations to assemble a genomic map of the organism. 

15 2. A method according to claim 1 , wherein said creating 

representations of the genome comprises : 

creating a representation of the genomic segments in individual 
clones by selecting a subpopulation of genomic segments out of a larger set of the 
genomic segments in that clone. 

20 

3. A method according to claim 2, wherein said selecting a 
subpopulation of genomic segments comprises: 

subjecting an individual clone to a first restriction endonuclease 
under conditions effective to cleave DNA from the individual clone so that a 
25 degenerate overhang is created in the clone and 

adding non-palindromic complementary linker adapters to the 
overhangs in the presence of ligase and the first restriction endonuclease to select or 
amplify particular fragments from the first restriction endonuclease digested clone as 
a representation, whereby sufficient linker-genomic fragment products are formed to 
30 allow determination of a DNA sequence adjacent the overhang. 
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4. A method according to claim 3, wherein the first restriction 
endonuclease creates 2 base degenerate overhangs in the clone and 1 to 12 non- 
palindromic linker adapters, which contain single stranded overhangs of the formula 
NN/N'N' where NN/N'N' is selected from the group consisting of AA/TT, AC/GT, 

5 AG/CT, CA/TG, GATTC, and GG/CC, are used. 

5. A method according to claim 4, wherein 4 to 6 non-palindromic 
. adapters are used. 

10 6. A method according to claim 3, wherein the first restriction 

endonuclease creates 3 base degenerate overhangs in the clone and 1 to 16 non- 
palindromic complementary linker adapters, which contain single stranded overhangs 
of the formulae NAA, NAC, NAG, NAT, NCA, NCC, NCG, NCT, NGA, NGC, 
NGG, NGT, NT A, NTC, NTG, and NTT, with N being any nucleotide, are used. 

15 

7. A method according to claim 6, wherein 5 to 9 non-palindromic 
linker adapters are used. 

8. A method according to claim 3, wherein the first restriction 
20 endonuclease is selected from the group consisting of Drdl, BgH, DralU, AlWHl, 

PflMl Accl 55/HKAI, SanDh SexAl Ppul, Avail EcoO\09, Bsu36l, BsrDl Bsgl 
Bpml, Sapl^ and isoschizomers thereof 

9. A method according to claim 3, wherein said generating nucleic 
25 acid sequence information from the representations comprising: 

using sequencing primers to obtain sequence information from 
the ends of a subpopulation of genomic segments out of a larger set of genomic 
segments. 

30 1 0. A method according to claim 9, wherein the sequencing 

primers have a 5' sequence that is complementary to the adapter primers and have a 3* 
sequence that is complementary to two or more bases in the degenerate overhang 
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and/or adjacent to the restriction site recognition sequence to obtain sequencing 
information adjacent to the restriction site. 

11. A method according to claim 10, wherein 1 to 12 sequencing 
5 primers are used with a 3' end from the set which end in NN, with N being any 

nucleotide, and/or it's complement N'N'. 

12. A method according to claim 3, wherein 1 to 16 sequencing 
primers are used with a 3' end from the set which end in NAA, NAC, NAG, NAT, 

1 0 NC A, NCC, NCG, NCT, NGA, NGC, NGG, NGT, NTA, NTC, NTG, and NTT, with 
N being any nucleotide. 

13. A method according to claim 2, wherein said selecting a 
subpopulation of genomic segments comprises: 

1 5 subjecting an individual clone to a first restriction endonuclease 

vmder conditions effective to cleave DNA from the individual clone so that a 

palindromic overhang is created in the clone; and 

adding complementary linker adapters to the overhangs in the 

presence of ligase and the first restiiction endonuclease to amplify particular fragments 
20 from the first restriction endonuclease digested clone as a representation whereby 

sufficient linker-genomic fragment products are formed to allow determination of a 

DNA sequence adjacent the overhang. 

14. A method according to claim 13, wherein the first restriction 
25 endonuclease is BamHl, Avril, Nhel, Spel, Xbal, Kpnl, Sphl. AatU, Ageh Xmah 

NgoML^ BspEl, Mlul, 5acII, fe/Wl, Pstl, ApalA, or isoschizomers thereof 

15. A method according to claim 13, wherein said generating 
nucleic acid sequence information from the representations comprising: 

30 using sequencing primers to obtain sequence information from 

the ends of a subpopulation of genomic segments out of a larger set of genomic 
segments. 
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1 6. A method according to claim 15, wherein tlie sequencing 
primers have a 5' sequence that is complementary to the adapter primers and have a 3' 
sequence that is complementary to two or more bases adjacent to a restriction site 

5 recognition sequence to obtain sequencing information adjacent to the restriction site. 

17. A method according to claim 2, wherein said selecting a 
subpopulation of genomic segments comprises: 

subjecting an individual clone to a first restriction endonuclease 
10 under conditions effective to cleave DNA from the individual clone so that a first non- 
palindromic overhang is created in the clone; 

subjecting an individual clone to one or more second restriction 
endonuclease under conditions effective to cleave DNA from the individual clone so 
that a second overhang different from the first overhang is created in the clone; 
15 adding complementary linker adapters to the first and second 

overhangs in the presence of ligase, the first restriction endonuclease, and the one or 
more second restriction endonuclease to amplify particular fragments from the 
restriction endonuclease digest as a representation, whereby sufficient linker-genomic 
fragment products are formed to allow determination of DNA sequences adjacent to the 
20 overhangs. 

18. A method according to claim 1 7, wherein the first restriction 
endonuclease creates 2 base degenerate overhangs in the clone and 1 to 12 non- 
palindromic linker adapters, which contain single stranded overhangs of the formula 

25 NN/N'N' where NN/N'N* is selected from the group consisting of AA/TT, AC/GT, 
AG/CT, CA/TG, GA/TC, and GG/CC, are used. 

19. A method according to claim 1 8, wherein 4 to 6 non- 
palindromic adapters are used. 

30 

20. A method according to claim 17, wherein the first restriction 
endonuclease creates 3 base degenerate overhangs in the clone and 1 to 16 non- 
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palindromic complementary linker adapters, which contain single stranded overhangs 
of the formulae NAA, NAC, NAG, NAT, NCA, NCC, NCG, NCT, NGA, NGC, 
NGG, NGT, NTA, NTC, NTG, and NTT, with N being any nucleotide, are used. 

5 21 . A method according to claim 20, wherein 5 to 9 non- 

palindromic linker adapters are used. 

22. A method according to claim 17, wherein the first restriction 
endonuclease is selected from the group consisting of Drdl, Bgll, DralU, AlwNl, 

10 PflMl, SanDl, SexAl, Ppul Avail, £coO109, Bsu36l BsrDl, Bsgl Bpml, Sapl and an 
isoschizomer thereof and the one or more second restriction endonuclease is Maelly 
Mspl, Bfal, Hhal, HiriPU, Cjp6I, Taql, Msel, or an isoschizomer thereof 

23. A method according to claim 17, wherein said generating 
1 5 nucleic acid sequence information from the representations comprising: 

using sequencing primers to obtain sequence information from 
the ends of a subpopulation of genomic segments out of a larger set of genomic 
segments. 

20 24. A method according to claim 23, wherein ( 1 ) the sequencing 

primers have a 5' sequence that is complementary to the adapter primers of the first 
restriction site and have a 3' sequence that is complementary to two or more bases in 
the degenerate overhang and/or adjacent to a first restriction site recognition sequence 
to obtain sequencing information adjacent to the first restriction site and/or (2) the 

25 sequencing primers have a 5' sequence that is complementary to the adapter primers 
of one or more second restriction site and have a 3' sequence that is complementary to 
two or more bases adjacent to the one or more second restriction site recognition 
sequence to obtain sequencing information adjacent to the one or more second 
restriction site. 

30 

25. A method according to claim 17, wherein 1 to 12 sequencing 
primers are used to obtain sequence information adjacent to the first restriction 
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endonuclease site with the sequencing primers having a 3' end from the set which end 
in NN, with N being any nucleotide, and/or it's complement N'N'. 

26. A method according to claim 17, wherein 1 to 16 sequencing 
5 primers are used to obtain sequence information adjacent to the first restriction 

endonuclease site with the sequencing primers having a 3 ' end from the set which ends 
in NAA, NAC, NAG, NAT, NCA, NCC, NCG, NCT, NGA, NGC, NGG, NGT, NTA, 
NTC, NTG, and NTT, vAth N being any nucleotide. 

10 27. A method according to claim 1, wherein said generating nucleic 

acid sequence information from the representations comprising: 

using sequencing primers to obtain sequence information from 
the ends of a subpopulation of genomic segments out of a larger set of genomic 
segments. 

15 

28. A method according to claim 27, wherein 1 to 1 6 sequencing 
primers are used to obtain sequence information adjacent to the first restriction 
endonuclease site with the sequencing primers having a 3' end fi-om the set which 
ends in NAA, NAC, NAG, NAT, NCA, NCC, NCG, NCT, NGA, NGC, NGG, NGT, 

20 NTA, NTC, NTG, and NTT, with N being any nucleotide. 

29. A method according to claim 27, wherein unique sequencing 
data is generated for a unique target knovm as a singlet sequencing run. 

25 30. A method according to claim 27, wherein two overlapping 

sequences are generated for two targets known as a doublet sequencing run. 

31. A method according to claim 27, wherein three overlapping 
sequences are generated for three targets knovm as a triplet sequencing run. 

30 
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32. A method according to claim 27, wherein the sequencing 
primer has one or two additional bases on its 3' end to obtain unique singlet sequence 
information from two or more overlapping sequences. 

5 33. A method according to claim 1 , wherein said analyzing 

sequence information comprises: 

analyzing sequencing data generated from representations to 
deconvolute singlet, doublet and triplet sequencing runs and to determine clone 
overlap. 

10 

34. A method according to claim 33, wherein two singlet 
sequencing runs in the same representation set from separate genomic clones are 
compared, said method further comprising: 

evaluating the two singlet sequencing runs for clone overlap by 
15 aligning the sequencing runs and scoring identity in at least 8 bases beyond the 
endonuclease recognition site with less than 3 discordant positions. 

35. A method according to claim 33, wherein a singlet and a 
doublet sequencing rim in the same representation set from separate genomic clones are 

20 compared, said method further comprising: 

evaluating the singlet and doublet sequencing runs for clone 
overlap by aligning the sequencing runs and either scoring identity in at least 8 bases 
beyond the endonuclease recognition site which are identical in the doublet sequence 
with the singlet sequence or, alternatively, by scoring at least 16 cases beyond the 

25 endonuclease recognition site where the singlet sequence is consistent wdth either of the 
bases in the doublet sequence at that position, with less than 3 discordant positions. 

36. A method according to claim 33, wherein a singlet and a triplet 
sequencing run in the same representation set from separate genomic clones are 

30 compared, said method further comprising: 

evaluating whether the clones overlap by aligning the 
sequencing runs, considering only those positions in the triplet run where two or less 
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bases are read, and either scoring identity in at least 8 bases beyond the endonucleasc 
recognition site which are identical in the triplet sequence with the singlet sequence or, 
alternatively, by scoring at least 1 6 cases beyond the endonucleasc recognition site 
where the singlet sequence is consistent with either of the bases in the triplet sequence 
5 at that position, with less than 3 discordant positions. 

37. A method according to claim 33, wherein a doublet and a 
doublet sequencing run in the same representation set from separate genomic clones are 
compared, said method further comprising: 

1 0 evaluating whether the clones overlap by aligning the 

sequencing runs and scoring identity in at least 16 cases beyond the endonucleasc 
recognition site which are either cases where either doublet sequence has an identical 
base which is consistent with one or the other of the two bases represented in the other 
doublet sequence, or cases where both doublet sequences have the same two bases at 

1 5 that position, with less than 3 discordant positions. 

38. A method according to claim 33, wherein a doublet and a triplet 
sequencing run in the same representation set from separate genomic clones are 
compared, said method further comprising: 

20 evaluating whether the clones overlap by aligning the 

sequencing runs, considering only those positions where two or less bases are read, and 
scoring identity in at least 16 cases beyond the endonucleasc recognition site which are 
either cases where either doublet or triplet sequence has an identical base which is 
consistent with one or the other of the two bases represented in the other sequence, or 

25 cases where the doublet and triplet sequences have the same two bases at that position, 
with less than 3 discordant positions. 

39. A method according to claim 33, wherein two sequencing runs 
from separate genomic clones in the same representation are compared with either run 

30 being a singlet, doublet, or triplet, said method further comprising: 

evaluating whether the clones are likely not to overlap by 
aligning the sequencing runs and scoring discordance in at least 3 positions. 
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40. A method according to claim 1, wherein said combining clone 
overlap and sequence information comprises: 

comparing sequence information in a second representation, 
5 present on clones which mark ends of contiguous portions of a first representation, to 
extend and overlap contigs between representations. 

41. A method according to claim 1 , wherein said combining clone 
overlap and sequence information comprises: 

1 0 generating sequence information using a different restriction 

endonuclease representation on clones which mark ends of contiguous portions in a 
first representation. 

42. A method according to claim 1 , wherein said combining clone 
1 5 overlap and sequence information comprises: 

using singlet sequences in the representations and end 
sequences for each clone to provide additional sequence information for aligning 
contiguous portions with the known databases for that organism. 

20 43 . A method according to claim 1 , wherein said combining clone 

overlap and sequence information comprises: 

obtaining xmique singlet sequence information from 
overlapping doublet and triplet sequences, to provide additional sequence information 
for aligning contiguous portions with the known databases for that organism. 

25 

44. A method of identifying single nucleotide polymorphisms in 
genomic DNA comprising: 

creating representations of the genomes of multiple individuals; 
creating a representational library from the representation; 
30 generating nucleic acid sequence information from individual 

clones of the representational library; and 
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analyzing tlie sequence information lo identify single 
nucleotide polymoq)hisms among the multiple individuals. 

45. A method according to claim 44, wherein said creating 
5 representations of the genomes of multiple individuals comprises: 

subjecting tlie genomes of multiple individuals to a first 
restriction endonuclease under conditions effective to cleave DNA so that a first non- 
palindromic overhang is created in the genomes of multiple individuals; 

subjecting the genomes of multiple individuals to a one of more 
10 second restriction endonuclease under conditions effective to cleave DNA so that a 
second overhang is created in the genomes of multiple individuals; 

adding complementarj' linker adapters to the first and second 
overhangs in the presence of ligase, the first restriction endonuclease, and the one or 
more second restriction endonuclease; and 
15 adding PGR primers to amplify fragments from the restriction 

endonuclease digest as a representation. 

46. A method according to claim 45, wherein the first restriction 
endonuclease creates 2 base degenerate overhangs in the genomes of multiple 

20 individuals and 1 to 12 non-palindromic linker adapters, which contain single 

stranded overhangs of the formula NN/N'N' where NN/N'N' is selected from the 
group consisting of AA/TT, AC/GT, AG/CT, CA/TG, GA/TC, and GG/CC, are used. 

47. A method according to claim 45, wherein the first restriction 
25 endonuclease creates 3 base degenerate overhangs in the genomes of multiple 

individuals and 1 to 16 non-palindromic complementary linker adapters, which 
contain single stranded overhangs of the formula NAA, NAG, NAG, NAT, NCA, 
NCC, NCG, NGT, NGA, NGC, NGG, NGT. NTA, NTG, NTG, and NTT, with N 
being any nucleotide, are used. 

30 

48. A method according to claim 45, wherein the first restriction 
endonuclease is selected from the group consisting of Drdl, Bgll, Dralll, yj/wNI, 
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PJIMl SanDl, SexAh Ppul Avail, EcoOl09, Bsu36l BsrDU Bsgl Bpml, Sapl and aii 
isoschizomer thereof and the one or more second restriction endonuclease is Maell, 
Mspl, Bfal, Hhal, HinPU, C5/76I, Taql^ Msel, or an isoschizomer thereof. 

5 49. A method according to claim 44, wherein said creating 

representations of the genomes of multiple individuals comprises; 

subjecting the genomes of multiple individuals to a first 
restriction endonuclease under conditions effective to cleave DNA so that a 
palindromic overhang is created in the genomes of multiple individuals; and 
10 adding complementary Hnker adapters to the overhangs in the 

presence of ligase and the first restriction endonuclease; and 

adding PCR primers to amplify fragments from the restriction 
endonuclease digest as a representation, 

15 50. A method according to claim 49, wherein the first restriction 

endonuclease is BamlU, Avrll, Nhel, Spel, Xbal, Kpnl, Sphl, Aatll, Agel, Xmal, 
NgoMl, BspEly Mlul, Sadly BsiWl, Pstl, ApaLl, or isoschizomers thereof. 

51. A method according to claim 45, wherein PCR primers amplify 
20 fragments from the restriction endonuclease digest as a representation and a single 
linker-adapter primer is used to select fragments containing only one of the 
degenerate overhangs and the representation of the genome contains approximately 
35,500 fragments. 

25 52. A method according to claim 5 1 , wherein a size selection of 

approximately 200 to 1,000 bp is applied prior to amplification, and the representation 
of the genome contains approximately 19,700 fragments. 

53 . A method according to claim 5 1 , wherein a size selection of 
30 approximately 200 to 2,000 bp is applied prior to amplification, and the representation 
of the genome contains approximately 25,000 fragments. 
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54. A method according to claim 45, wherein PGR primers amplify 
fragments from the restriction endonuclease digest as a representation and more than 
one linker-adapter primer is used to select fragments containing some of the 
degenerate overhangs, a size selection of approximately 200 to 1 ,000 bp is applied 

5 prior to amplification, and the representation of the genome contains approximately 
40,000 fragments. 

55. A method according to claim 45, wherein PGR primers amplify 
fragments from the restriction endonuclease digest as a representation and more than 

1 0 one linker-adapter primer is used to select fragments containing some of the 

degenerate overhangs, a size selection of approximately 200 to 1 ,000 bp is applied 
prior to amplification, and the representation of the genome contains approximately 
120,000 fragments. 

15 56. A method according to claim 45, wherein PGR primers amplify 

Augments from the restriction endonuclease digest as a representation and a single 
linker-adapter primer is used to select fragments containing only one of the 
degenerate overhangs, a size selection of approximately 200 to 1,000 bp is applied 
prior to amplification, a PGR primer with one or two selective bases on the 3' end is . 

20 used during the PGR amplification step, and the representation of the genome 
contains approximately 5,000 fragments. 

57. A method according to claim 44, wherein a representational 
library is created from the representation and the linker-adapters used to generate the 
25 representation are methylated and PGR primers used to amplify the representation are 
uiunethylated, such that the PGR amplified fragments may be cleaved in both primers 
to allow for directional cloning of fragments into a cloning vector. 



30 



58. A method for large scale detection of single nucleotide 
polymorphisms on a DNA array comprising: 

creating a representation of the genome from a clinical sample; 



wo 00/40755 



- 161 - 



PCTAJSOO/00144 



providing a plurality of oligonucleotide probe sets, each set 
characterized by (a) a first oligonucleotide probe, having a target-specific portion and 
an addressable array-specific portion, and (b) a second oligonucleotide probe, having 
a target-specific portion and a detectable reporter label, wherein the oligonucleotide 
5 probes in a particular set are suitable for ligation together when hybridized adjacent to 
one another on a corresponding target nucleotide sequence, but have a mismatch 
which interferes with such ligation when hybridized to any other nucleotide sequence 
present in the representation of the sample; 

providing a ligase, 

10 blending the sample, the plurality of oligonucleotide probe sets, 

and the ligase to form a mixture; 

subjecting the mixture to one or more ligase detection reaction 
cycles comprising a denaturation treatment, wherein any hybridized oligonucleotides 
are separated from the target nucleotide sequences, and a hybridization treatment; 

1 5 wherein the oligonucleotide probe sets hybridize at adjacent positions in a base- 
specific maimer to their respective target nucleotide sequences, if present in the 
sample, and ligate to one another to form a ligated product sequence containing 
(a) the addressable array-specific portion, (b) the target-specific portions connected 
together, and (c) the detectable reporter label, and, wherein the oligonucleotide probe 

20 sets may hybridize to nucleotide sequences in the sample other than their respective 
target nucleotide sequences but do not ligate together due to a presence of one or 
more mismatches and individually separate during the denaturation treatment; 

providing a support with different capture oligonucleotides 
immobilized at particular sites, wherein the capture oligonucleotides have nucleotide 

25 sequences complementary to the addressable array-specific portions; 

contacting the mixture, after said subjecting, with the support 
under conditions effective to hybridize die addressable array-specific portions to the 
captxire oligonucleotides in a base-specific maimer, thereby capturing the addressable 
array-specific portions on the support at the site with the complementary capture 

30 oligonucleotide; and 
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detecting the reporter labels of ligated product sequences 
captured on the support at particular sites, thereby indicating the presence of single 
nucleotide polymorphisms. 

5 59. A method according to claim 58, wherein the oligonucleotide 

probes in a set are suitable for ligation together at a ligation junction when hybridized 
adjacent to one another on a corresponding target nucleotide sequence due to perfect 
complementarity at the ligation junction, but, when the oligonucleotide probes in the 
set are hybridized to any other nucleotide sequence present in the sample, have a 
10 mismatch at a base at the ligation junction which interferes with such ligation. 

60. A method according to claim 59, wherein the mismatch is at the 
3' base at the ligation junction. 

15 6 1 . A method according to claim 58, wherein said creating a 

.» 

representation of the genome from a clinical sample comprises: 

subjecting the clinical sample to a first restriction endonuclease 

under conditions effective to cleave DNA so that a first non-palindromic overhang is 

created in the clinical sample; 
20 subjecting the clinical sample to a one of more second 

restriction endonuclease imder conditions effective to cleave DNA so that a second 

overhang is created in the clinical sample; 

adding complementary linker adapters to the first and second 

overhangs in the presence of ligase, the first restriction endonuclease, and the one or 
25 more second restriction endonuclease; and 

adding PCR primers to amplify fragments from the restriction 

endonuclease digest as a representation. 



30 



62. A method according to claim 6 1 , wherein the first restriction 
endonuclease creates 2 base degenerate overhangs in the clinical sample and 1 to 12 
non-palindromic linker adapters, which contain single stranded overhangs of the 
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formula NN/N'N' where NN/N'N' is selected from the group consisting of AA/TT, 
AC/GT, AG/CT, CA/TG, GA/TC, and GG/CC, are used. 

63. A method according to claim 61 , wherein the first restriction 
5 endonuclease creates 3 base degenerate overhangs in the clinical sample and 1 to 16 

non-palindromic complementary linker adapters, which contain single stranded 
overhangs of the formula NAA, NAC, NAG, NAT, NCA, NCC, NCG, NCT, NGA, 
NGC, NGG, NGT, NTA, NTC, NTG, and NTT, widi N being any nucleotide, are 
used. 

10 

64. A method according to claim 61, wherein the first restriction 
endonuclease is selected from the group consisting of Drdl, Bg/I, Dralll, AlwNl, 
PfMh SariDL SexA\, Ppul, Avail, EcoO\0% Bsu36h BsrDl, BsgL Bpml Sapl and an 
isoschizomer thereof and the one or more second restriction endonuclease is Maell, 

15 Mspl, Bfal, Hhal, Hin? II, Csp6ly Taql, Msel, or an isoschizomer thereof. 

65. A method according to claim 58, wherein said creating 
representations of the genomes of a clinical sample comprises: 

subjecting the clinical sample to a first restriction endonuclease 
20 under conditions effective to cleave DN A so that a palindromic overhang is created in 
the clinical sample; 

adding complementary linker adapters to the overhangs in the 
presence of ligase and the first restriction endonuclease; and 

adding PGR primers to amplify fi-agments from the restriction 
25 endonuclease digest as a representation. 

66. A method according to claim 58, wherein the first restriction 
endonuclease is BamHl, Avrlh Nhel, Spel, Xbal, Kpn\, Sphl, Aatll, Agel, Xmal, 
NgoMl, BspEl, Mini, Sadly BsiWl, Pstl, ApaLl, or isoschizomers thereof 

30 

67. A method according to claim 61 , wherein PGR primers amplify 
fragments from the restriction endonuclease digest as a representation and a size 
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selection of approximately 200 to 2,000 bp is applied prior to amplification, 
improving the yield of fragments in the representation. 

68. A method according to claim 61 , wherein a single linker- 
5 adapter primer is used to select firagments containing only one of the degenerate 

overhangs and a PGR primer complementary to this linker adapter with one additional 
selective base on the 3' end is used during the PGR amplification step. 

69. A method according to claim 61 , wherein more than one linker- 
1 0 adapter primers are used to select firagments containing some of the degenerate 

overhangs and PGR primers complementary to the more than one linker adapter with 
one additional selective base on the 3' end are used. 

70. A method according to claim 61 , wherein PGR primers amplify 
1 5 fragments from the restriction endonuclease digest as a representation and a single 

linker-adapter primer is used to select fragments containing only one of the 
degenerate overhangs and PGR primers complementary to this linker adapter with one 
additional selective base on the 3* end are used.. 

20 7 1 . A method according to claim 61 , wherein PGR primers amplify 

firagments firom the restriction endonuclease digest as a representation and a single 
linker-adapter primer is used to select fi-agments containing only one of the 
degenerate overhangs and PGR primers complementary to this linker adapter with two 
additional selective bases on the 3' end is used during the PGR amplification step. 

25 

72. A method according to claim 61 , wherein PGR primers amplify 
firagments from the restriction endonuclease digest as a representation and a single 
linker-adapter primer is used to select fragments containing only one of the 
degenerate overhangs and a PGR primer complementary to this linker adapter with 
30 two additional selective bases on the 3' end is used during the PGR amplification step. 
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73. A method according to claim 58, wherein said plurality of 
oligonucleotide probe sets comprises: 

(a) a first oligonucleotide probe, having a target-specific 
portion complementary to a first allele and a first addressable array-specific portion, 
5 (b) a second oligonucleotide probe, having a target-specific portion complementary to 
a second allele and a second addressable array-specific portion and (c) a third 
oligonucleotide probe, having a target-specific portion and a detectable reporter label, 
wherein the first and third oligonucleotide probes set are suitable for ligation together 
when hybridized adjacent to one another on a corresponding first allele target 

10 nucleotide sequence, wherein the second and third oligonucleotide probes set are 
suitable for ligation together when hybridized adjacent to one another on a 
corresponding second allele target nucleotide sequence, but each set has a mismatch 
which interferes with such ligation when hybridized to any other nucleotide sequence 
present in the representation of the sample and, wherein the reporter labels of ligation 

1 5 product sequences captured to the support at particular sites during said detecting 
where the presence of reporter label at the complement of the first addressable array- 
specific portion indicates the presence of the first allele, while presence of reporter 
label at the complement of the second addressable array-specific portion indicates the 
presence of the second allele, for each set, thereby indicating allele differences. 

20 

74. A method according to claim 73, wherein the oligonucleotide 
probes in a set are suitable for ligation together at a ligation junction when hybridized 
adjacent to one another on a corresponding target nucleotide sequence due to perfect 
complementarity at the ligation junction, but, when the oligonucleotide probes in the 

25 set are hybridized to any other nucleotide sequence present in the sample, have a 
mismatch at a base at the ligation junction which interferes with such ligation. 

75. A method according to claim 73, wherein the mismatch is at the 
3' base at die ligation junction. 

30 

76. A method according to claim 73, wherein the first and second 
alleles differ by a single nucleotide. 
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77. A method according to claim 73, wherein said method is used 
to quantify an allele imbalance between first and second alleles and the different 
capture oligonucleotides immobilized at particular sites are substantially the same for 

5 both the first allele target nucleotide sequence and the second allele target nucleotide 
sequence, wherein the oligonucleotide probe sets have either of two reporter labels 
which can be detected and distinguished independently so that ligation product 
sequences for the furst allele target nucleotide sequence and the second allele target 
nucleotide sequence are captured on the support with the ratio of the first reporter 

10 label to the second reporter label at the complement of the first addressable array- 
specific portion divided by the ratio of the first reporter label to the second reporter 
label at the complement of the second addressable array-specific portion reflecting an 
initial allele ratio for each test and normal allele position and the relative imbalance of 
the first and second alleles in a test sample is determined by dividing the initial allele 

1 5 ratio for the test sample by the initial allele ratio for a normal sample, whereby (1) a 
ratio of > 1 indicates that the first allele is in that number-fold greater in quantity than 
the second allele, (2) a ratio of < 1 indicates that the second allele is in the inverse 
number-fold greater in quantity than the first allele, and (3) a ratio of about 1 indicates 
the first and second allele are present in about the same quantity. 

20 

78. A method according to claim 77, wherein said method is for 
quantifying loss of heterozygosity (LOH) or gene amplification in a tumor sample 
containing up to 50% stromal contamination by comparing allele imbalance at a 
tumor gene locus with allele balance at a control gene locus among a tumor and 

25 normal sample firom the same individual and the different capture oligonucleotides 

immobilized at particular sites are substantially the same for both the first allele target 
nucleotide sequence and the second allele target nucleotide sequence, the two alleles 
being heterozygous at both the tumor gene locus and the control gene locus with the 
ratio of the first reporter label to the second reporter label at the complement of the 

30 first addressable array-specific portion for the tumor gene locus divided by the ratio of 
the first reporter label to the second reporter label at the complement of the first 
addressable array-specific portion for the control gene locus reflecting an initial tumor 
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to control first allele ratio, wherein for both test and normal samples where the ratio of 
the first reporter label to the second reporter label at the complement of the second 
addressable array-specific portion for the tumor gene locus divided by the ratio of the 
first reporter label to the second reporter label at the complement of the second 
5 addressable array-specific portion for the control gene locus reflects an initial tumor 
to control second allele ratio and a presence of gene amplification or LOH of the first 
and second tumor alleles in the tumor sample is determined by dividing the initial 
tumor to control level for a given allele ratio for the tumor sample by the initial tumor 
to control level for a given allele ratio for the normal sample where (1) a ratio of > 2 

10 for a first tumor gene allele indicates the first timior gene allele is amplified in the 

tumor sample, compared with the normal sample, (2) a ratio of > 2 for a second tumor 
gene allele indicates the second tumor gene allele is amplified in the tumor sample, 
compared with the normal sample, (3) a ratio of < 0.5 for a first tumor gene allele 
shows that the first tumor gene allele underwent LOH in the tumor sample, compared 

1 5 with the normal sample, (4) a ratio of < 0.5 for a second tumor gene allele shows that 
the second tumor gene allele underwent LOH in the tumor sample, compared vnXh the 
normal sample, and (5) a ratio of about 1 indicates a given tumor allele did not 
undergo LOH or amplification, compared with the normal sample. 

20 79. A method according claim according to claim 73, wherein the 

method is utilized for quantifying an allele imbalance between a test sample and a 
normal sample with each set characterized by both first and second oligonucleotide 
probes, a percentage of each have a second distinct detectable reporter label, wherein 
the two reporter labels may be detected and distinguished independently such that the 

25 ratio of the first reporter label to the second reporter label at the complement of the 
first addressable array-specific portion divided by the ratio of the first reporter label to 
the second reporter label at the complement of the second addressable array-specific 
portion reflects an initial allele ratio for each test and normal allele position and a 
relative imbalance of the first and second alleles in the test sample is determined by 

30 dividing the initial allele ratio for the test sample by the initial allele ratio for the 
normal sample, wherein (1) a ratio of > 1 indicates that the first allele is in that 
number-fold greater in quantity than the second allele, (2) a ratio of < 1 indicates that 
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the second allele is in the inverse number-fold greater in quantity than the first allele, 
and (3) a ratio of about 1 indicates that the first and second allele are present in about 
the same quantity, indicating there is no allele imbalance compared with the normal 
sample. 

5 

80. A method according to claim 79, wherein said method is 
carried out for quantifying loss of heterozygosity (LOH) or gene amplification in a 
tumor sample containing up to 50% stromal contamination by comparing allele 
imbalance at a tumor gene locus with allele balance at a control gene locus among a 

1 0 tumor and normal sample from the same individual with the two alleles being 

heterozygous at both the tumor gene locus and the control gene locus and the ratio of 
the first reporter label to the second reporter label at the complement of the first 
addressable array-specific portion for the tumor gene locus divided by the ratio of the 
first reporter label to the second reporter label at the complement of the first 

1 5 addressable array-specific portion for the control gene locus reflecting an initial tumor 
to control first allele ratio, such that for both test and normal samples, the ratio of the 
first reporter label to the second reporter label at the complement of the second 
addressable array-specific portion for the tumor gene locus divided by the ratio of the 
first reporter label to the second reporter label at the complement of the second 

20 addressable array-specific portion for the control gene locus reflects an initial tumor 
to control second allele ratio and the presence of gene amplification or LOH of the 
first and second tumor alleles in the tumor sample is determined by dividing the initial 
tumor to control for a given allele ratio for the tumor sample by the initial tumor to 
control for a given allele ratio for the normal sample, wherein (1) a ratio of > 2 for a 

25 first tumor gene allele indicates the first tumor gene allele is amplified in the tumor 
sample, compared with the normal sample, (2) a ratio of > 2 for a second tumor gene 
allele indicates the second tumor gene allele is amplified in the tumor sample, 
compared v^th the normal sample, (3) a ratio of < 0.5 for a first tumor gene allele 
indicates the first tumor gene allele underwent LOH in the tumor sample, compared 

30 with the normal sample, (4) a ratio of < 0.5 for a second tumor gene allele indicates 
the second tumor gene allele underwent LOH in the tumor sample, compared with the 
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normal sample, and (5) a ratio of about 1 indicates a given tumor allele did not 
undergo LOH or amplification, compared with the normal sample. 

81. A method according to claim 58, wherein said providing a 
5 plurality of oligonucleotide probe sets with each set characterized by (a) a first 

oligonucleotide probe, having a target-specific portion complementary to a first allele 
and a first detectable reporter label, (b) a second oligonucleotide probe, having a 
target-specific portion complementary to a second allele and a second distinct 
detectable reporter label and (c) a third oligonucleotide probe, having a target-specific 

10 portion and a addressable array-specific portion, wherein the first and third 

oligonucleotide probes set are suitable for ligation together when hybridized adjacent 
to one another on a corresponding first allele target nucleotide sequence, wherein the 
second and third oligonucleotide probes set are suitable for ligation together when 
hybridized adjacent to one another on a corresponding second allele target nucleotide 

1 5 sequence, but each set has a mismatch which interferes with such ligation when 
hybridized to any other nucleotide sequence present in the representation of the 
sample with the two reporter labels being detected and distinguished independently 
such that detection of the first reporter label at the complement of the addressable 
array-specific portion indicates a presence of the first allele, while detection of the 

20 second reporter label at the complement of the addressable array-specific portion 
indicates a presence of the second allele, for each set. 

82. A method according to claim 81, wherein the mismatch is at a 
3' base at the ligation junction. 

25 

83. A method according to claim 81 , wherein the first and second 
alkies differ by a single nucleotide. 

84. A method for according to claim 8 1 , wherein said method is 
30 used to quantify an allele imbalance between first and second alleles and the different 

capture oligonucleotides immobilized at particular sites are substantially the same for 
both the first allele target nucleotide sequence and the second allele target nucleotide 
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sequence, wherein the oHgonucleotide probe sets have either of two reporter labels 
which can be detected and distinguished independently so that ligation product 
sequences for the first allele target nucleotide sequence and the second allele target 
nucleotide sequence are captured on the support at particular sites with the ratio of the 
5 first reporter label to the second reporter label at the complement of the first 

addressable array-specific portion divided by the ratio of the first reporter label to the 
second reporter label at the complement of the second addressable array-specific 
portion reflecting an initial allele ratio for each test and normal allele position and the 
relative imbalance of the first and second alleles in the test sample is determined by 

1 0 dividing the initial allele ratio for the test sample by the initial allele ratio for the 
nonnal sample, whereby (1) a ratio of > 1 indicates that the first allele is in that 
number-fold greater quantity over the second allele, (2) a ratio of < I indicates that the 
second allele is in the inverse number-fold greater quantity over the first allele, and 
(3) a ratio of about 1 determines the first and second allele are present in about the 

15 same quantit)'. 

85. A method according to claim 8 1 , wherein said method is for 
quantifying loss of heterozygosity (LOH) or gene amplification in a tumor sample 
containing up to 50% stromal contamination by comparing allele imbalance at a 

20 tumor gene locus with allele balance at a control gene locus among a tumor and 
normal sample from the same individual and the different capture oligonucleotides 
immobilized at particular sites are substantially the same for both the first allele target 
nucleotide sequence and the second allele target nucleotide sequence, the two alleles 
being heterozygous at both the tumor gene locus and the control gene locus with the 

25 ratio of the first reporter label to the second reporter label at the complement of the 

first addressable array-specific portion for the tumor gene locus divided by the ratio of 
the first reporter label to the second reporter label at the complement of the first 
addressable array-specific portion for the control gene locus reflecting an initial tumor 
to control first allele ratio, wherein for both test and normal sample where the ratio of 

30 the first reporter label to the second reporter label at the complement of the second 
addressable array-specific portion for the tumor gene locus divided by the ratio of the 
first reporter label to the second reporter label at the complement of the second 



wo 00/40755 



PCTAJSOO/00144 



-171- 

addressable array-specific portion for the control gene locus reflects an initial tumor 
to control second allele ratio and a presence of gene amplification or LOH of the first 
and second tumor alleles in the tumor sample is determined by dividing the initial 
. tumor to control level for a given allele ratio for the tumor sample by the initial tumor 
5 to control level for a given allele ratio for the normal sample where (1 ) a ratio of > 2 
for a first tumor gene allele indicates the first tumor gene allele is amplified in the 
tumor sample, compared with the normal sample, (2) a ratio of > 2 for a second tumor 
gene allele indicates the second tumor gene allele is amplified in the tumor sample, 
compared vwth the normal sample, (3) a ratio of < 0.5 for a first tumor gene allele 
10 determines the first tumor gene allele underwent LOH in the tumor sample, compared 
with the normal sample, (4) a ratio of < 0.5 for a second tumor gene allele determines 
the second tumor gene allele imderwent LOH in the tumor sample, compared with the 
normal sample, (5) a ratio of about 1 determines a given tumor allele did not undergo 
LOH or amplification, compared with the normal sample. 

15 

86. A method according claim according to claim 81, wherein the 
method is utilized for quantifying an allele imbalance between a test sample and a 
normal sample with each set characterized by both first and second oligonucleotide 
probes, a percentage of each have a second distinct detectable reporter label, wherein 

20 the two reporter labels may be detected and distinguished independently such that the 
ratio of the first reporter label to the second reporter label at the complement of the 
first addressable array-specific portion divided by the ratio of the first reporter label to 
the second reporter label at the complement of the second addressable array-specific 
portion reflects an initial allele ratio for each test and normal allele position and the 

25 relative imbalance of the first and second alleles in the test sample is determined by 
dividing the initial allele ratio for the test sample by the initial allele ratio for the 
normal sample, wherein (1) a ratio of > 1 indicates that the first allele is in that 
number-fold greater quantity over the second allele, (2) a ratio of < 1 indicates that the 
second allele is in the inverse number-fold greater quantity over the first allele, and 

30 (3) a ratio of about 1 indicates that the first and second allele are present in about the 
same quantity, indicating there is no allele imbalance compared with the normal 
sample. 
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87. A method according to claim 8 1 , wherein said method is 
carried out for quantifying loss of heterozygosity (LOH) or gene amplification in a 
tumor sample containing up to 50% stromal contamination by comparing allele 

5 imbalance at a tumor gene locus with allele balance at a control gene locus among a 
tumor and normal sample from the same individual with the two alleles being 
heterozygous at both die tumor gene locus and the control gene locus and the ratio of 
the fu^t reporter label to the second reporter label at the complement of the first 
addressable array-specific portion for the tumor gene locus divided by the ratio of the 

1 0 first reporter label to the second reporter label at the complement of the first 

addressable array-specific portion for the control gene locus reflecting an initial tumor 
to control first allele ratio, such that for both test and normal sample, the ratio of the 
first reporter label to the second reporter label at the complement of the second 
addressable array-specific portion for the tumor gene locus divided by the ratio of the 

15 first reporter label to the second reporter label at the complement of the second 

addressable array-specific portion for the control gene locus reflects ^^an initial tumor 
to control second allele ratio and the presence of gene amplification or LOH of the 
first and second tumor alleles in the tumor sample is determined by dividing the initial 
tumor to control for a given allele ratio for the tumor sample by the initial tumor to 

20 control for a given allele ratio for the normal sample, wherein (1) a ratio of > 2 for a 
first tumor gene allele indicates the first tumor gene allele is amplified in the tumor 
sample, compared with the normal sample, (2) a ratio of > 2 for a second tumor gene 
allele indicates the second tumor gene allele is amplified in the tumor sample, 
compared with the normal sample, (3) a ratio of < 0.5 for a first tumor gene allele 

25 indicates the first tumor gene allele underwent LOH in the tumor sample, compared 
with the normal sample, (4) a ratio of < 0.5 for a second txmior gene allele indicates 
the second tumor gene allele underwent LOH in the tumor sample, compared with the 
normal sample, and (5) a ratio of about 1 indicates a given tumor allele did not 
undergo LOH or amplification, compared with the normal sample. 

30 

88. A method to sequence directly from a PGR amplified nucleic 
acid molecule without primer interference comprising: 
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amplifying a nucleic acid molecule using PGR primers 
containing alternative nucleoside bases under conditions effective to produce PGR 
amplification products and 

cleaving the PGR primers both incorporated and 
5 unincorporated in the PGR amplification products under conditions which leave the 
PGR amplification products intact. 



89. A method according to claim 88, wherein the PGR primers 
contain dUTP and starting primers and incorporated primers are cleaved using uracil- 
1 0 N-glycosylase (ung) prior to DN A sequencing. 



90. A method according to claim 88, wherein the PGR primers 
contain ribonucleosides and starting primers and incorporated primers are cleaved 
v^dth a base (O.IN NaOH) followed by neutralization with a buffer prior to DNA 
1 5 sequencing. 
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Sequencing Orc/I islands in random plasmid or cosmid clones 



Drxil 



1. PCR amplify fragment from 
random clone of a genomic 
DNA library. Cut with 
Drdl in the presence of 
linkers and T4 ligase. 
Linker for Drdl site is 
phosphorylaied and 
contains a 3* AA overhang. 
Biochemical selection 
assures that most AA sites 
contain linkers. (Separate 
reactions are performed for 
linkers containing the other 
non-palindroraic 3' 
overhangs). 

2. Inactivate T4 iigasc and 
restriction endonucleases at 
95**Cfor5min. Add longer 
sequencing primer which 
contains a 3* A A end, and 
perform a cycle-sequencing 
reaction. If sequence 
information is difTicult to 
interpret, additional 
selectivity can be achieved 
by preforming four separate 
sequencing reactions using 
sequencing primers 
containing 3' ends of AAA, 
A AC, AAG, andAAT 
respectively. 



■ GACNNNNNNGTC 
' CTGNNNNNNCAG 
1 



5* 

3' ' 



CTAATAA I 
— GATTA y 



i CTAATAANNGTC 
' GATTJ^TTNNCAG 



. AA 



•GATTATTNNCAG 



Or, 4 independent sequencing reactions: 



5' 

5* 

5- 
5' 



■AAA 

' AAC 

■AAG 
. AAT 



'GATTATTNNCAG 



FIG. 1 
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Scheme 1 for sequencing restriction endonuciease generated representations 



PGR amplify or partially 
purify DNA from clone. 



2. Cut with restriction 
endonuclease(s) in the 
presence of linkers and T4 
ligase. For endonucleases 
which create degenerate 
ends, add multiple 
divergent linkers with non- 
palindromic overhangs. 



template for cycle- 
sequencing. 



4. Aliquot into multiple wells 
and perform individual 
cycle-sequencing reactions 
using primers which are 
complementary to the 
particular linker sequences 
and/or which contain one or 
two additional selective 
bases on the 3' end. 



Individual clone 
from Cosmid, PAC. 
or BAC library 



3. PGR amplify if needed to 
generate sufficient DNA I 




FIG. 2 
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Scheme 2 for sequencing restriction endonuclease generated representations 



I . PGR amplify or panially 
purify DNA from clone. 



Individual clone 
from Cosmid, PAC, 
or BAG library 



2. Aliquot into multiple wells 
and cut with restriction 
endonuclease(s) in the 
presence of linkers and T4 
ligase. Each well contains 
linkers with different non- 
palindromic overhangs. 




3. PGR amplify if needed to 
generate sufficient DNA 
template for cycle- 
sequencing. 



4. Perform individual 
cycle-sequencing reactions 
using primers which are 
complementary to the 
particular linker sequences 
or which contain one or 
two additional selective 
bases on the 3' end. 





FIG. 3 
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DNA sequencing directly from PGR amplified DNA without primer interference 



PCR amplify using 
oligonudeotkles 
containing ribose U 
replacing dT. add dNTPs 
and Taq potymerase. 



2. Add 0.1 N NaOH and JJfa 5* Aihc= > 3' 

heat to 95*C for 5 min to ^ a' i vkkkvvvkvt.k^.k k -^y ^ip y 

destroy unused primers. 



. Neutralize, dilute into two 

new wells. Anrieal 5' 1 ~i 3" 5' 

fonrard and reverse ^ i 

primers in separate ^ 
reactions to nsn 

fluorescent dideoxy- 3* ivkkkk^vkkv<.<.kk i 5- 3" \ 

sequencing reactions. 



FIG. 4 
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Sequencing Drd\ islands in random BAC clones 



L Cut BAG DNAwith 
Mspl and Dnfi in the 
presence of linkers and T4 
ligase. Linker for Drdl site 
is phosphorylated and 
contains a 3' AA overhang. 
Linker for lifspl site is not 
phosphorylated, and 
contains a bubble. 
Biochemical selection 
assures that most sites 
contain linkers. 



2. Inactivate T4 ligase and 
restriction endonucleases at 
QSXforSmin. PGR 
amplify using primers 
containing ribose U 
replacing dT, dNTPs, and 
To^ polymerase. Primer 
specific to the Drdi site 
linker will extend through 
bubble of Mspl site linker. 
This allows the primer 
specific to the Msplsiic 
linker to amplify the Drdl- 
Msplfrzgmtni MspUMspl 
fragments will not amplify 
since they contain bubbles 
on both ends. 



Drdl 
I 

• GACNNNNNNGTC- 
■ CTGNNNNNNCAG- 
t 



— CTAATAA I 
GATTA V 



•CTAATAANNGTC- 
•GATTATTNNCAG- 



Msp\ 

4 

• CCGG 
• GGCC • 

t 



CGT ' 



'CTAATAANNGTC- 



• AA 



' GATTATTNNCAG - 



• GGCA ' 



to 95 for 5 mm to 3* GATTATTNNCAG GGCA — 5. 

destroy unused primers. A A A A 



4. Neutralize and dilute. 
Anneal sequencing primer 
to the Drd[ site linker and 
perform a cycle-sequencing 
reaction. (A separate 
reaction may be performed 



4 



5' AA 

5* AANNGTC CCGT 3' 

5' 



using a primer ajjnealing to ^ cArTA^G SSg^ 

the Mspx site unker). 



FIG. 5 



SUBSTITUTE SHEET (RULE 26) 



wo 00/40755 



PCT/USOO/00144 



6/103 



Sequencing Drcfl islands In random BAC clones 



1. CuiBAC DNAwiih Drrfl, 
Msp\ and Taql in (he 
presence of linkers and T4 
ligasc. Linker for Z>/x/I site 
is phosphorylated and 
contains a 3' A A overhang. 
Linker for MspUTaql site is 
phosphorylated, 3' blocked 
and contains a bubble. 
Biochemical selection 
assures that most sites 
contain linkers. 

2. Inactivate T4 Ugase and 
restriction endonucleases at 
95°Cfor5min. PCR 
amplify using primers 
containing ribose U 
replacing dX dNTPs, and 
7a^ polymerase. Primer 
specific to the Drdl site 
linker will extend through 
bubble of Mspl site linker. 
This allows the primer 
specific to the Mspl site 
linker to amplify the Drdl- 
Msp\ fragment. Other 
fragments will not amplify 
since they contain bubbles 
on both ends. 



Drdi 
I 

■ GACNNNNNNGTC- 
• CTGNNNNNNCAG- 
t 



Mspl (or Tiz^l) 

4 

■ CCGG 

• GGCC 

t 



3' 
5' 



5' 

3' ' 



— CTAATAA I 
— - GATTA 4^ 



•CTAATAANNGTC- 
■GATTATTNNCAG- 



i 



■CTAATAANNGTC- 



' CCGT ■ 



> AA 



• GATTATTNNCAG - 



• GGCA ' 



' Bk 3' 



3. Add O.INNaOHand heat 
to 95 "CforSminto 
destroy unused primers. 



5' 
3* 



T T T 

— CUAAUAANNGTC- 
GATTATTNNCAG - 



• CCX3T- 
' GGCA . 



kk A A 



4, Ncuualize and dilute. 
Anneal sequencing primer 
to the Drdl site linker and 
perform a cycle-sequencing 
reaction. (A separate 
reaction may be performed 
using a primer annealing to 
the Mspl/Taql site linker). 



5* 
3' 



AA 

AANNGTC- 
• GATTATTNNCAG - 



■ CCGT ■ 
•GGCA 



3* 
5' 



FIG. 6 
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Three degrees of specificity in amplifying a Drdl representation. 



1. Ligation of the top strand requires perfect complementarity at the 3' side 
of the junction (50-fold specificity). 

2. Ligation of the bottom strand requires perfect complementarity at the 3' 
side of the junction (50-fold specificity). 

3. Extension of polymerase off the sequencing primer is most efficient if the 
3' base is perfectly matched (10 to 100-fold specificity). 



5' 
3" 



AA 

CTAATAANNGTC 

GATTATTNNCAG 



3' 
5* 



FIG. 7 
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PCT/USOO/00144 



RG253B 1 3. 7q3 1 Met Oncogene 

12 Drdl and 16 BgflSites in 171,905 bp 



25000 



50000 
I 



75000 
I 



100000 



125000 
I 



150000 



Bgll 


ie 1 


Drd 1 


12 1 


Drdin 


Location 


1. 


5,379 


2. 


26,865 


3. 


33,300 


4. 


45,528 


5. 


70,522 


6. 


91,675 


7. 


96,500 


8. 


99.622 


9. 


101.434TT''' 


10. 


11 3,042 AC' 


11. 


137,171TT*'' 


12. 


159,679 AG* 



n 



n 



rm r 



TTrr 



Overhang 

GG'* 

GT* 

GG'* 

AT 

AT 

TC® 

CA® 

cr 



AA'* 
GT* 
AA" 
CT* 



Complement 

CC-* 

AC' 

CC 

AT 

AT 

GA® 

TG® 

AG* 





Location 


Overhang 


Complement 
ACA^" 


1. 


13,833 


70^" 
ACA'" 




2. 


25,115 






3. 


33.890 


GAA" 




770® 


4. 


51,623 


707*" 




AOA*" 


5. 


58.308 


C7A*'' 




7AO'' 


6. 


88.316 


77A** 




7AA'' 


7. 


94,134 


GOG*' 




COO** 


8. 


99.463 


ACA*" 




707*" 


9. 


1 00.045 ACC* 




GG7*'' 




10. 


106,6 13 CCA'" 




700'* 




11. 


129,192707*'' 




ACA*" 




12. 


137,747707® 




AGA** 




13. 


149,246707*" 




ACA*" 




14. 


156.577777® 




AAA" 




15. 


161,461 COA*" 




7C0® 




16. 


1 65,697 C70® 




CAO* 





Drdl Bgll 

Unique sites, per 40 kb (singlet). ( 1 .4) (3.3) 

^Same last 2 bases of 3' overhang, per 40 kb (doublet). ( 1 .0) (4,3) 

^Palindromic overhang, not used. 2 

®Sanie last 2 bases of 3' overhang within Bac used exactly once (singlet). 2 5 

'Same last 2 bases of 3* overhang within Bac used exactly twice (doublet). 4 5 

^Same last 2 bases of 3' overhang within BAC used more than twice. 0 3 



FIG. 8 
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RG253B 1 3, 7q3 1 Met Oncogene 
25 Sapl Sites in 171,905 bp 



, 25000 50000 75000 100000 125000 150000 
I I I l__ I I l_ 

Sao I P5 I I II III! I I I II I I I I I 





i^oCdiion 


Sapl 




JLlgalCQ 






Overhang 


f^omnlpmf*nt 

TAG^ 


1. 


1,198 


CTA 




2. 


1.456 


AGG 




CCT* 


3. 


10,943 


GCT 




AGC" 


4. 


10,955 


GCT 




ACQ® 


5. 


11.041 


CAA 




TTG® 


6. 


31,031 


AAT 




ATT® 


7. 


32.599 


GAT 




ATC* 


8. 


37.053 


AGA 




TCT* 


9. 


38.931 


GGG 




CCC" 


10. 


39.877 


ATC 




GAT* 
AAG* 


11. 


44.325 


CTT 




12. 


56.040 


ACA 




TGT" 


13. 


68,850 


ACC 




GGT" 


14. 


76,930 


GTG 




CAC* 


15. 


100,250GGG 




CCC 




16. 


112,850GAT 




ATC* 




17. 


135.473ACA 




TOT* 




18. 


135.608GGA 




TCC 




19. 


136.239TTG 




CAA® 




20. 


142.243 GCC 




GGC" 




21. 


148,475 GCG 




CGC" 




22. 


157.978TCT 




AGA'* 




23. 


1 60,833 ACC 




GGT" 




24. 


166,153ATT 




AAT* 




25. 


171.460GTT 




AAC* 





Same last 2 bases of 3' overhang within BAC used exactly once(singlet). 5 
*Same last 2 bases of 3' overhang within BAC used exactly twice (doublet). 10 
"Same last 2 bases of 3' oveiiiang within BAC used more than twice. 3 



FIG. 9 
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RG363E19. 7q3.1 HMG gene 

1 1 Drdl and 12 BgllSites in 165,608 bp 



20000 40000 60000 80000 100000 120000 140000 160000 

I I I I j_ I I \ i_r 

Bgi I 12 ! 1 I I I I ! ! ill I 

Drdl 11 II II i il I II 



Drdlit 

1. 

2. 

3. 

4. 

5. 

6. 

7. 

8. 

9. 

10. 

11. 

Bga# 

1. 

2. 

3. 

4. 

5. 

6. 

7. 

8. 

9. 

10. 

11. 

12. 



Location 

30.500 

41,442 

63,326 

64,189 

70,300 

77,512 

78,858 

92.723 

132,104 

137.827 

161.478 

Location 

14,666 

54,284 

60,389 

67.808 

86,331 

99,283 

104,281 

109,938 

122,096 

129,631 

139,404 

163,611 



Overhang 

CT" 

GG* 

AG'" 

TT® 

GT* 

CA-" 

TG'" 

TG** 

GA® 

CC 

AT 

Overhang 

GAG* 

AGA'" 

AGA''' 

CCT*" 

TGG'" 

CTC* 

GTT* 

CGG*" 

GGG*" 

TGT® 

AAA® 

TCT" 



Complement 

AG"^ 

CC* 

CT** 

AA® 

AC® 

TG'" 

CA"'' 

CA*" 

TC® 

GG* 

AT 

Complement 

CTC* 

TCT'" 

TCT*" 

AGG'" 

CCA* 

GAG* 

AAC® 

CCG® 

CCC® 

ACA* 

TTT* 

AGA" 



Unique sites, per 40 kb (singlet). 

'Same last 2 bases of 3' overhang, per 40 kb (doublet). 

"Palindromic overhang, not used. 



Drdl Bga 

(1.2) (3.9) 

(1.2) (2.0) 
1 



FIG. 10 
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RG363E19,7q3.1 HMG gene 
12 Sapl Sites in 165,608 bp 



20000 
I 



40000 
I 



60000 
I 



80000 100000 120000 140000 16000C 



J. 



Sapl 12 



SapW 


Location 


Sapl 


Ligated 






Overhang 


Complement 


1. 


3,048 


ACA 


TGT® 


2. 


14.192 


CGG 


CCG® 


3. 


45,137 


CTA 


TAG" 


4. 


49,039 


TAG 


GTA* 


5. 


56,731 


ccr 


AGG® 


6. 


62,838 


TAA 


TTA* 


7. 


70,117 


TGG 


CCA® 


8. 


90,393 


AAA 




9. 


104,917 


CTT 


AAG" 


10. 


138,863 


CTG 


CAG" 


11. 


144,649 


AAA 


TTT" 


12. 


146,805 


AAA 





Same last 2 bases of 3' overhang within BAC used exactly once (singlet). 4 
*'Same last 2 bases of 3' overhang within BAC used exactly twice (doublet). 1 
^Same last 2 bases of 3' overhang within BAC used more than twice. 2 



FIG. 11 
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RG364PI6, 7q31 Pendrin gene 

10 Drdi and 17 Bg/ISites in 97,943 bp 



10000 20000 30000 40000 50000 60000 70000 80000 90000 
I I I I I _J I I I I 



Bgl I 17 J 
Drd I 10 n. 



II 



1. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 
9. 
10. 

1. 

2. 

3. 

4. 

5. 

6. 

7. 

8. 

9. 

10. 

11. 

12. 

13. 

14. 

15. 

16. 

17. 



Location 
620 
1,478 
1,697 
18.677 
19.514 
25.223 
39.952 
65,731 
66,419 
67412 

Location 
2.239 
2,779 
3.237 
15.292 
41.224 
44.028 
46.962 
51.342 
52,497 
53,002 
53,027 
58.872 
59.766 
63.339 
89.301 
92,307 
93,558 



Overhang 

cr" 

CT'" 

AG'" 

TT® 

GA® 

AA'" 

AA** 

TA* 

Gcr 

Overhang 

GGC*" 

AAC" 

GGC*" 

ATG'" 

CAC'" 

GCA*" 

AGG*" 

AGT'* 

TTA» 

AAT" 

TGG*" 

CAC*" 

CCT'" 

TGG*" 

TGT** 

ATA* 

TTC® 



Complement 

GG® 

AG*" 

AG'" 

CT'" 

AA® 

TC® 

TT" 

TT" 

TA" 

GC" 

Complement 
GCC 
GTT* 
GCC* 
CAT" 
GTG*" 
TGC" 
CCT'" 
ACT*" 
TAA' 
ATT* 
CCA" 
GTG*" 
AGG*" 
CCA*" 
ACA 
TAT' 
GAA' 



•X 



^ Unique sites, per 40 kb (singlet). 

'Same last 2 bases of 3' overhang, per 40 kb (doublet). 

"Palindromic overhang, not used. 

®Same last 2 bases of 3' overhang within Bac used exactly once (singlet). 
'Same last 2 bases of 3' overhang within Bac used exactly twice (doublet). 
"Same last 2 bases of 3' overhang within BAC used more than twice. 



DrA 
(1.3) 
(2.1) 

2 

3 

1 

1 



BgR 
(5.0) 
(9.2) 

1 
5 

7 



FIG. 12 



SUBSTITUTE SHEET (RULE 26) 



wo 00/40755 



PCTAJSOO/00144 



13/103 



RG364P16. 7q31 Pendrin gene 
14 Sites in 97,943 bp 



10000 20000 30000 40000 50000 60000 70000 80000 90000 
_J I I I ! I \ I I 



Sap I 14 



I I I li 



SapW 


Location 


Sapl 


Ligated 






Overhang 


Complement 
TAG^ 


1. 


2,731 


CTA 


2. 


8.819 


ATA 


TAT® 


3. 


27,714 


CAG 


CTG" 


4. 


28,452 


TCT 


AGA® 


5. 


37,174 


GAA 


TTC® 


6. 


40,339 


GTT 


AAC® 


7. 


44,149 


CAC 


GTG" 


8. 


48,133 


AAC 


GTT® 


9. 


49,746 


CTT 


AAG* 


10. 


55,020 


TTT 


AAA* 


11. 


56,593 


CAG 


CTG" 


12. 


60,911 


AGA 


TCT® 


13. 


76,747 


TTA 


TAA' 


14. 


89,658 


TGA 


TCA® 



Sapl 

*^Same last 2 bases of 3' overhang within BAG used exactly once (singlet). 7 
*Same last 2 bases of 3' overhang within BAC used exactly twice (doublet). 2 
^Same last 2 bases of 3* overhang within BAC used more than twice. 1 



FIG. 13 
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GS056H18, 7q31 alpha2(I) collagen 
1 1 DrcfL and 15 fi^/ISites in 1 16.466 bp 



20000 40000 60000 80000 100000 
I 1 I I I L__ 

Bgl I 15 I i i 11 i Mil!!! 

DrdI 11 I II I I I i I I 



Drdl§ 


Location 


Overhang 
AA"* 


Complement 


1. 


7,281 


TT'" 


2. 


41.553 


AA" 


TT'" 


3. 


49.116 


TG* 


CA* 


4. 


61.875 


GT' 
AC" 


AC** 


5. 


69.731 


GT" 


6. 


76,744 


AG® 


CT® 


7. 


83,697 


GG® 


CC® 


8. 


95.410 


TA' 


TA* 


9. 


102,312 


TC" 


GA** 


10. 


107,014 


TC 


GA" 


11. 


114,581 


CA" 


TG* 





Location 


Overhang 


Complement 
CTG" 


1. 


26 


CAG'" 


2. 


12.014 


TTA* 


TAA** 
CAG*" 


3. 


27.316 


CTG*" 
AAA** 


4. 


37.513 


TTT® 


5. 


37.810 


GTA" 


TAG* 


6. 


52.919 


CTG'" 


CAG'" 


7. 


70.083 


ACA'" 


TGT*" 


8. 


72.753 


ACA'" 


TGT*" 


9. 


79,674 


CGA* 


TOG** 


10. 


85,304 


GCG'* 


CGC® 


11. 


88.200 


GTC'* 


GAG* 


12. 


95,350 


GAA* 


TTC** 


13. 


105.353 


ACA'" 


Tcr" 


14. 


111,096 


CCC" 


GGG® 


15. 


115,757 


TCC* 


GGA* 



Drdl Bg[[ 

Unique sites, per 40 kb (singlet). (1.4) (3.1) 

*Same last 2 bases of 3' overhang, per 40 kb (doublet). (2. 1 ) (7.2) 

^Palindromic overhang, not used. 1 

®Same last 2 bases of 3' oveihang within Bac used exactly once (singlet). 2 4 

'Same last 2 bases of 3' overhang within Bac used exactly twice (doublet). 4 7 

''Same last 2 bases of 3' overhang within BAC used more than twice. 0 3 



FIG. 14 
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GS056H18,7q31 alpha2(I) collagen 
18 Sapl Sites in 1 16,466 bp 



20000 
I 



40000 
I 



60000 
I 



80000 
I 



100000 



Sap I 18 



I I I 



SapW 


Location 


Sapl 


Ligated 


1. 


676 


Overhang 
AAA 


Complement 

Tn* 

GAG" 


2; 


2.235 


CTC 


3. 


6.921 


CTG 


CAG" 


4. 


11.596 


ACC 


GOT' 


5. 


24.903 


GCT 


AGC* 


6. 


46.819 


AAA 


TTT" 


7. 


47.742 


TCC 


GGA» 


8. 


48.563 


ATT 


AAT® 


9. 


54.507 


TCT 


AGA* 


10. 


57.797 


ACT 


AGT* 


11. 


60,140 


TAG 


GTA® 


12. 


67.461 


AAG 


CTT" 


13. 


73.821 


AAT 


ATT" 


14. 


78.670 


CTG 


CAG" 


15. 


82,755 


CCT 


AGG® 


16. 


88,654 


AGT 


ACT* 


17. 


89.773 


GCA 


TGC» 


18. 


100,380 


CTC 


GAG" 



Sapl 

®Same last 2 bases of 3' overhang within B AC used exactly once(singlet). 4 
*Same last 2 bases of 3' overhang within BAC used exactly twice (doublet). 3 
^Same last 2 bases of 3* overhang within BAC used more than twice. 2 



FIG. 15 
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Sequencing Bgli islands in random BAC clones 



LCut BAC DNA with 
Mspl and Bgl\ in the 
presence of linkers and T4 
ligase. tinker for BgU site 
is phosphorylated and 
ends in 3* NAC overhang. 
Linker for Msp\ site is not 
phosphorylated. and 
contains a bubble. 
Biochemical selection 
assures that most sites 
contain linkers. 



Bgll 
i 

' GCCNNNNNGGC- 
' CGGNNNNNCCG- 
t 



5' 

3' * 



• NAC 



. CTAATACNGGC- 
' GATTATGNCCG- 



. Inactivate T4 ligase and 
restriction endonucleases at 
95Xfor5min. PCR 
amplify using primers 
containing ribose U 
replacing dT, dNTPs. and 
Ta<7 polymerase. Add 
ClNNaOH and heat to 
95 X for 5 mi n to destroy 
unused primers. 



• GCCNTAC 
■ CGGN 



i 



5' 

3' 



▼ T T 



▼ T 

CUAAUACNGGC- 
GATTATGNCCG- 



3. Neutralize and dilute. 
Anneal sequencing primer 
to ihtBgli site linker and 
perform a cycle-sequencing 
reaction. (A separate 
reaction may be performed 



5' AC 



using a primer anneaUng to ^ gattat^CCg" 

the Afspl site linker). 



4. A separate linker ligation 

reaction is performed on the 

second half of the Bgl\ site . ^ . 5' 

Mspl 



' gccntacttag 



using a phosphorylated 3' CGGNATGAATC 

primer which ends in a ATN ■ 

3' NTA seqence. 



FIG. 16 
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Sequencing Bgll islands in random BAC clones 



1. CulBACDNAwiih 
Mspl and BgH in the 
presence of linkers and T4 
ligase. Linker for £^ /I site 
is phosptiorylated and 
ends in 3' ACN overhang. 
Linker for Mspl site is not 
phosphorylated, and 
contains a bubble. 
Biochemical selection 
assures that most sites 
contain linkers. 

2. Inactivate T4 ligase and 
restriction endonucleases at 
95°Cfor5min. PGR 
amplify using primers 
containing ribose U 
replacing dT, dNTPs, and 
To^ polymerase. Primer 
spedftc to the BgH site 
linker will extend through 
bubble of Mspl site linker. 
This allows the primer 
specific to the Afjplsite 
linker to amplify the Drdl- 
MspX fragment. Mspl-Mspl 
fragments will not amplify 
since they contain bubbles 
on both ends. 



Bgil 
I 

• GCCNNN^FNGGC- 
■ CGGNNNNNCCG- 
t 



— CTAAACN I 
GATT 



Mspl 
4 

■ CCGG • 
■GGCC ■ 

f 



' CTAAACNNGGC- 
' GATTTGNNCCG" 



CGT ' 



' CTAAACNNGGC- 



5* 
3., 



-AC 



' GATTTGNNCCG- 



■GGCA i 



3.A*J0.1NN.OH™11»„ 5. ^ ciUoACBGeC CCST J 

to 95 "*C for 5 min to 3' GATTATGNCCG— GGCA — ^ 5. 

destroy unused primers. A A k k 



4. Neutralize and dilute. 
Anneal sequencing primer 
to the Bgli site linker and 
perfonn a cycle-sequencing 
reaction. (A separate 



reaction may be performed 
using a primer annes 
the Mspl site linker). 



,,^i^n, o t« 5* ACNGGC CCGT 3* 

usmgapnmcranntahngto 3. GATTATGNCCG GGCA 5. 



FIG. 16A 
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Sequencing Sapl islands in random BAC clones 



I. Cut BAC DNAwith 
Mspl and Sapl in the 
presence of linkers and T4 
ligase. Linker for Sapl site 
is phosphorylated and 
ends in 5' NAC overhang. 
Linker for Mspl site is not 
phosphorylated, and 
contains a bubble. 
Biochemical selection 
assures that most sites 
contain linkers. 



Sapl 



' GCTCTTCNNNN- 
■ CGAGAAGNNNN- 



. NUG 



. CTAATACAATG- 
' GATTATGTTAC - 



Mspl 
i 

' CCGG 
■ GGCC 

t 




2. Inactivate T4 ligase and 
restriction endonucleases at 
95*^0 for 5 min. PCR 
amplify using primers 
containing ribose U 
replacing dT, dNTPs, and 
Taq polymerase. Add 
O.lNNaOH and heat to 
95 °C for 5 min to destroy 
unused primers. 



■ GCTCTTCN 
• CGAGAAGNTAC 



▼ T ▼ T 

— CUAAUACAAUG- 
GATTATGTTAC - 



■ CCGT • 
■ GGCA " 



Ak A A 



3. Neutralize and dilute. I 
Anneal sequencing primer V 
to the Sapl site linker and 
perform a cycle-sequencing 

reaction, (A separate „ 

reacuon may be perfonned ^, 

using a primer annealing to 3, GATTATGTTAC -^ZHZZI^^ZZ gg^ J 

the Mspl site linker). 



4. A separate linker ligation * 

reaction is performed on the ^. 

J L ir^ r u r , • 5 GCTCTTCNATG 3' 

second half of the Sapl site 3 CGAGAAGNTAC — 5' 

using a phosphorylated 1 

primer which ends in a * 

3' NTA scqencc. However. 

this reforms the Sapl site, I 
and thus the linker is V 
cleaved off preventing 

substantial DNA 5* GCTCTTCN 

amplification. 3* CGAGAAGNTAC 



FIG. 17 
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Probability of Two or more Singlets or Doublets in BAC 




« of RE sites In BAC 



ProbablBty of Two or more Singlets in BAC 




15 20 25 

« of RE sites in BAC 



30 35 



40 



FIG. 17 A 
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Alignment of BAG sequences generated from Drdl sites: 

1 . tcgtcctcaggaactgaagctatataatcagttaagtccctgcttctgatctcttc 

2 . gtgtcaagtaaagaagtacagcagataagtaaaacggaaaaaaataatgaaagaattacaaaggaagactaacx;aaaga^ 

3 . aagtctacaatcaagaggccaactgattccatgtctggtgagggtctatttcct^ 

4 . TAGTCCTCAATTTCACCATGGATTAAATAACAGAACACAGAGTTACTGTGAGACTTGT(^ 

5 . GTGTCATCTAGCTATAAATCTAAAGATAATAATAAAATTGGAAAGATTTTCATCAGATAGACTTTO 

Concordant sequences: Doublet to singlet. 

1 . TCGTCCTCAGGAACTGAAGCTATATAATCAGTTAAGTCCCTGCTTCTGATCTCTTCTG^ 
2 . GTGTCAAGTAAAGAAGTACAGCAGATAAGTAAAACGGAAAAAAATAATGAAAGAATTACAAAGG^^ 



'TU L.J J T TJ T T LTTJ. 



Concordant sequences: Doublet to Doublet. 

1 . TCGTCCTCAGGAACTGAAGCTATATAATCAGTTAAGTCCCTGCITCTGATC 

2 . GTGTCAAGTAAAGAAGTACAGCAGATAAGTAAAACGGAAAAAAATAATGAAAGAATTACAAAGGAA 



nH.^lTlTTl. . J.^LTlTi L I Lii^ll ^ 1 TTTI^i 



^ • AAGTCTACAATCAAGAGGCCAACTGATTCCATGTCTGGTGAGGGTCTATTT^ 

2 . gtgtcaagtaaagaagtacagcagataagtaaaacggaaaaaaataatgaaag;^ 

Concordant sequences: Doublet to Triplet. 

1 . tcgtcctcaggaactgaagctatataatcagttaagtccci^tt 

2 . gtctcaagtaaagaagtacagcagataagtaaaacggaaaaaaataatgaaagaattacaaag^ 

in .J . T . J. . L" L T-^ T .11 , TTT . 

3 . aagtctacaatcaagaggccaactgattccatgtctggtgagggtctatt^ 
2 . gtctcaagtaaagaagtacagcagataagtaaaacggaa 

4 . tagtcctcaatttcaccatggattaaataacagaacacagagttactgtgagacttc 

Discordant sequences: Doublet to singlet. 

1 . tcgtcctcagg^actgaagctatataatcagttaagtccctgcttctgatc 

2 . gtgtcaagtaaagaagtacagcagataagtaaaacggaaaaaaataatgaaagaato^ 

mx XX XXX X XXX X X XX X XXX XXj XX XX xxT XX xxxxxxxxxjx 

3 . AAGTCTACAATCAAGAGGCCAACTGATTCCATGTCTGGTGAGGGTCTATTTCCTGGrc 

Discordant sequences: Doublet to Doublet. 

1 . TCGTCCTCAGGAACTGAAGCTATATAATCAGTTAAGTCCCTGCTTCTO 

2 . GTGTCAAGTAAAGAAGTACAGCAGATAAGTAAAACGGAAAAAAATAATGAAAGAA^^ 

JH XXXx X XXXx""Xx" X XX 5ocx" 

3 . AAGTCTACMICAAGAGGCCAACTGATTCCATGTCTGGTGAGGGTCTATT^ 
4 . TAGTCCTCAATTTCACCATGGATTAAATAACAGAACACAG^ 

Discordant sequences: Doublet to Triplet. 

1 . TCGTCCTCAGGAACTGAAGCTATATAATCAGTTAAGTCCCTGCTTCTGATCTCTT 

2 . GTGTCAAGTAAAGAAGTACAGCAGATAAGTAAAACGGAAAAAAATAATGAAAGA^ 

IHXX XX X xxXXXx 

3 . AAGTCTACAATCAAGAGGCCAACTCATTCCATGTC 

4 . TAGTCCTCAAJTTCACCATGGATTAAATAACA^ 

5 . GTGTCATCTAGCTATAAATCTAAAGATAATAATAAAATTGGAAAGATTTTCATCAGATAGACTTTTAA^^ 
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Drdl/Msel Fragments in approximately 2 MB of human DNA 

(BACs anakyzed: RG253B13, RG013N12, RG300C03. RG022J17, RG067E13, RGOl 1J21, 
RG022C01. RG043K06, RG343P13. RG205G13, O68P20. H_133K23. RG363E19. RG364P16. 
GS056H18, RG083J23, RG103H13, and RGl 18D07) 



For AA overhangs (30 Fragments) 



Drdltt 


Location 


Overhang 


Complement 


Nearest 


Fragment 


9. 








Msel 


Length 


101,440 




AA*(T) 
AA* 


100753 


687 


8. 


125,589 




124941 


648 


8. 


65,737 


AA*(C) 




66359 


622 


2. 


41,548 


AA*(C) 
AA* 




41918 


370 


3. 


21,755 




22080 


325 


11. 


148,484 


AA* 




148770 


286 


15. 


180,054 




AA* 


179781 


273 


1. 


7,287 


AA*(A) 




7551 


264 


4. 


64,195 




AA* 


63964 


231 


2. 


16192 




AA* 


16002 


190 


5. 


19,520 




AA* 


19354 


166 


7. 


1 12,864 




AA* 


112716 


148 


9. 


67,981 


AA*(A) 




68102 


121 


10. 


76,325 


AA*(C) 




76443 


118 


6. 


73,322 


AA* 




73424 


102 


10. 


158,579 




AA* 


158499 


80 


1. 


9,941 




AA*(C) 


9867 


74 


8. 


65.625 




AA* 


65554 


71 


6. 


45,326 




AA* 


45263 


63 


14. 


168,400 




AA* 


168352 


48 


7. 


39,958 


AA*(C) 




40005 


47 


2. 


27,073 




AA*(A) 


27027 


46 


8. 


144,712 


AA*(A) 
AA* 


144750 


38 


3. 


30,987 




31013 


26 


10. 


1 14962 


AA* 




1 14986 


24 


4. 


89309 




AA* 


89290 


19 


1. 


4518 


AA* 




4532 


14 


11. 


137,177 




AA*(A) 
AA* 


137176 


1 


12. 


165,140 




165139 


1 


9. 


86,690 




AA* 


86689 


1 



For AC overhangs ( 14 Fragments) 

DrdlU Location Overhang Complement Nearest Fragment 

Msel Length 

4. 61,881 AC* 61424 457 

5. 70,306 AC* 69996 400 



Fia 19 
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5. 


51333 


AC* 


2. 


17,346 




2. 


26,871 




2. 


16,508 


AC* 


4. 


45929 


AC* 


6. 


104,064 




8. 


80.512 




9. 


113,009 




6. 


100,564 




5. 


69,737 


AC* 


10. 


1 13,048 


AC* 


5. 


89,050 


AC* 



For AG overhangs ( 1 8 Fragments) 
DrdlM Location Overhang 



7. 


124,720 




8. 


99,628 




7. 


55.076 




11. 


146,074 


AG* 


3. 


63,332 


AG* 


2. 


1,484 




1. 


30,506 




4. 


51345 


AG* 


12. 


159,685 


AG* 


3. 


1.703 




5. 


26,574 




9. 


125.495 


AG* 


9. 


84,646 




6. 


76,750 


AG* 


11. 


137111 




5. 


71871 


AG* 


4. 


18,683 


AG* 


2. 


27,400 


AG* 





51712 


379 


AC* 


17135 


211 


AC* 


26668 


203 




16703 


195 




46051 


132 


AC* 


103955 


109 


AC* 


80423 


89 


AC* 


112938 


71 


AC* 


100500 


64 




69789 


52 




113095 


47 




89180 


30 


Complement 


Nearest 


Fragment 




Msel 


Length 


AG* 


123644 


1076 


AG* 


99513 


546 


AG* 


54728 


348 




146412 


338 




63546 


214 


AG* 


1273 


211 


AG* 


30700 


194 




51500 


155 




159827 


142 


AG* 


1593 


110 


AG* 


26478 


96 




125587 


92 


AG* 


84587 


59 




76794 


44 


AG* 


137072 


39 




71907 


36 




18707 


24 




27409 


9 



For CA overhangs ( 28 Fragments) 
DrtM Location 



1. 
5. 
8. 
4. 
7. 
7. 



11,050 

40,727 

92,729 

28263 

96,506 

68476 



Overhang 



CA*(G) 



CA*(A) 
CA* 



Complement 

CA*(T) 

CA*(G) 
CA* 



Nearest 

Msel 

10453 

41277 

92225 

27859 

96800 

68753 



Fragment 

Length 

597 

550 

504 

404 

294 

277 



FIG. 19(cont.) 
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3. 


40,167 




CA*(T) 


39891 


276 


7. 


104,893 


CA*(G) 


105141 


248 


12. 


174,759 




CA*(G) 


174553 


206 


3. 


24,762 


CA* 


24967 


205 


7. 


78,864 




CA*(T) 


78672 


192 


3. 


27,738 


CA*(A) 


27922 


184 


11. 


1 14,587 


CA*(G) 




1 14739. 


152 


4. 


25.393 


CA*(G) 




25529 


136 


1. 


1797 


CA*(T) 


1663 


134 


7. 


56,328 




CA*(A) 


56194 


134 


5. 


47,359 




CA*(T) 


47234 


125 


3. 


49,122 




CA*(G) 


48998 


124 


11. 


92,418 


CA*(T) 


92512 


94 


7. 


142,867 




CA*(G) 


142773 


94 


12. 


98,198 


CA*(A) 


98284 


86 


6. 


60,501 


CA*(T) 


60424 


77 


8. 


83,536 


CA*(A) 




83598 


62 


6. 


77,518 


CA* 




77578 


60 


7. 


41,602 


CA*(T) 




41644 


42 


9. 


149,703 


CA*(A) 




149735 


32 


10. 


128,190 


CA*(G) 


128168 


22 


5. 


40,370 




CA*(G) 


40357 


13 



For GA overhangs (15 Fragments) 

DrdlU Location Overhang Complement 



10. 

10. 

8. 

9. 

6. 



138,792 
107,020 
105,928 
132,110 
25,229 



GA* 
GA* 
GA* 



GA* 
GA* 



Nearest 

Msel 

138206 

106698 

105714 

132317 

25384 



Fragment 

Length 

586 

322 

214 

207 

155 









Figure 19 (cont.) 
GA* 4225 




1. 


4,328 




103 


4. 


29,833 


GA* 




29929 


96 


13. 


166,309 


GA* 




166386 


77 


4. 


66,836 




GA* 


66763 


73 


8. 


139,856 




GA* 


139797 


59 


9. 


102.318 




GA* 


102277 


41 


5. 


97330 




GA* 


97292 


38 


6. 


91.681 




GA* 


91645 


36 


11. 


153.548 


GA* 




153569 


21 


14. 


169.979 


GA* 




169996 


17 



FIG. 19 (cont.) 
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For GG overhangs (14 Fragments) 



Drcan 


Location 


Overhang Complement 


Nearest 


Fragment 








Msel 


Length 


3. 


33,306 


GG* 


34241 


935 


3. 


43,961 


GG* 


44471 


510 


2. 


41,448 


GG* 


41745 


297 


7. 


83,703 


GG* 


83957 


254 


13. 


180,666 


GG* 


180498 


168 


2. 


19,383 


GG* 


19227 


156 


10. 


137,833 


GG* 


137722 


111 


5. 


89,627 


GG* 


89570 


57 


9. 


129,058 


GG* 


129003 


55 


9. 


74,360 


GG* 


74409 


49 


12. 


154,063 


GG* 


154021 


42 


1. 


5,385 


GG* 


5417 


32 


1. 


626 


GG 


596 


30 


6. 


49,989 


GG* 


50001 


12 



FIG. 19(cont) 



SUBSTITUTE SHEET (RULE 26) 



wo 00/40755 



PCT/USOO/00144 



25/103 



Drdl/MspI/TaqI Fragments in approximately 2 MB of human DNA 

(RG253B13, RG013N12, RG300C03, RG022J17, RG067E13, RGOl 1J21, RG022C01, 
RG043K06, RG343P13, RG205G13. O68P20. H_133K23, RG363E19. RG364P16, GS056H18, 
RG083J23,RG103H13, andRG118D07) 



For AA overhangs (28 Fragments) 



Drdia 


Location 


Overhang 


Complement 


Nearest 
Mspl 


Nearest 
Taql 


Fragment 
Length 


14. 


168,400 




AA* 


Drdl( 157,688) 162,381 


6,019 


10. 


158,579 




AA* 


151,605 


153,001 


5,578 


2. 


41,548 


AA*(C) 






46.609 


5,061 


1. 


9,941 




AA*(C) 


296 


6,494 


3,447 


7. 


39,958 


AA*(C) 




43,295 


45,578 


3,337 


7. 


1 12,864 




AA* 


1 10.256 


Drdlf 104,064) 2,608 


10. 


1 14962 


AA* 




1 17286 


120674 


2324 


9. 


86.690 




AA* 


82,301 


84.647 


2,043 


3. 


21,755 


AA* 




27,904 


23.795 


2,040 


9. 


67,981 


AA*(A) 




71.232 


69.660 


1,679 


10. 


76,325 


AA*(C) 




79,607 


77,651 


1,326 


8. 


65,625 




AA* 


63,673 . 


64.515 


1,110 


1. 


4518 


AA* 




5549 


5792 


1031 


4. 


89309 




AA* 


88376 


86730 


933 


11. 


137,177 




AA*(A) 


135,890 


136,580 


597 


3. 


30,987 


AA* 


31,504 


Drdl(32,405) 


517 


15. 


180,054 




AA* 


179562 


176427 


492 


8. 


125,589 




AA* 


Drdl(124,720)125.163 


426 


5. 


73,322 


AA* 




75,251 


73.738 


416 


8. 


65,737 


AA*(C) 




66.175 


66.077 


340 


1. 


7,287 


AA*(A) 




8,799 


7,614 


327 


2. 


16192 




AA* 


15865 


15964 


228 


2. 


27,073 




AA*(A) 


25,402 


26,872 


201 


9. 


101,440 




AA*(T) 
AA* 




101,248 


192 


6. 


45,326 




45,207 


43,098 


119 


8. 


144,712 


AA*(A) 




145.939 


144,809 


97 


12. 


165,140 




AA* 


165069 


158079 


71 


11. 


148,484 


AA* 




148,536 




52 



For AC overhangs ( 14 Fragments) 

DrdlU Location Overhang Complement Nearest Nearest Fragment 

Mspl Taql Length 

9. 113.009 AC* 109,696 111,008 2,001 

6. 100,564 AC* 99,222 99,117 1,342 

5. 70,306 AC* 69,207 67,458 1,099 

2. 16.508 AC* 17.607 20.496 1,099 



FIG. 20 
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4. 45929 


AC* 




46933 


49057 


1004 


5. 69,737 


AC* 




72,665 


70,593 


856 


5. 89,050 


AC* 




93,107 


89,749 


699 


6. 104,064 




AC* 


103501 


103223 


563 


2. 17,346 




AC* 


16,821 


14,081 


525 


2. 26,871 




AC* 


26,363 


21,540 


508 


8. 80,512 




AC* 


78,243 


80,116 


396 


10. 113,042 


AC* 




122,429 


113,429 


381 


5. 51333 


AC* 




54102 


51541 


208 


4. 61,881 




AC* 


61,786 


60,430 


95 


For AG overhangs (12 Fragments) 










DrdlU Location 


Overhang 


Complement 


Nearest 


Nearest 


Fragment 








Mspl 


Taq\ 


Length 


4. 51345 


AG* 




57329 


59409 


5984 


7. 55,076 




AG* 


51,621 


53,820 


1,256 


11. 146,074 


AG* 




147289 


149991 


1215 


11. 137111 




AG* 


135970 


133640 


1141 


5. 26,574 




AG* 


25.682 




892 


9. 84,646 




AG* 


Drdl(83,536) 


83,821 


825 


5. 71871 


AG* 




73210 


72675 


804 


6. 76,750 


AG* 




77,964 


77,104 


354 


12. 159,685 


AG* 




160,038 


161,212 


353 


1. 30,506 




AG* 


30,330 


30,080 


176 


7. 124,720 




AG* 


124,563 


123,299 


157 


8. 99,628 




AG* 


99513 


99,370 


115 



For CA overhangs (25 Fragments) 



DrcRM Location 


Overhang 


Complement 


Nearest 


Nearest 


Fragment 




92,418 






Mspl 


Taq\ 


Length 


11. 


CA*(T) 




97.628 


97.710 


5,210 


10. 


128,190 




CA*(G) 


111,800 


125,432 


2.758 


8. 


92,729 




CA*(G) 


90.558 


90,541 


2.171 


5. 


40,727 


CA*(G) 




42.854 


43,404 


2.127 


7. 


41,602 


CA*(T) 




50.849 


43,487 


1.885 


11. 


114,587 


CA*(G) 




116.105 


1 16,257 


1,518 


5. 


47,359 




CA*(T) 


41.626 


45,860 


1.499 


7. 


56,328 




CA*(A) 


52.005 


55.150 


1,178 


12. 


174,759 




CA*(G) 


171,992 


173,598 


1,161 


3. 


49,122 




CA*(G) 




48.199 


923 


1. 


11,050 




CA*(T) 


10,189 


8.861 


861 



FIG. 20(cont.) 
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7. 


78.864 




CA*(T) 




78,112 


752 


7. 


96.506 


CA*(A) 




98,602 


97,059 


559 


7. 


142,867 




CA*(G) 
CA* 


135,955 


142,371 


496 


4. 


28.263 




27.904 


23,795 


359 


12. 


98,198 


CA*(A) 




98,497 


98,862 


299 


4. 


25.393 


CA*(G) 




25,682 




289 


8. 


83,536 


CA*(A) 




Drdl(84,646) 


83,821 


285 


7. 


104,893 


CA*(G) 




105,128 


105,920 


235 


5. 


40,370 




CA*(G) 


Drdl(32,405) 


40,215 


155 


6. 


60,501 




CA*(T) 


57,989 


60,462 


39 


7. 


68476 


CA* 


70850 


68488 


8 


3. 


27,738 


CA*(A) 
CA* 




30,751 


27,742 


4 


6. 


77,518 






77522 


4 


9. 


149,703 


CA*(A) 




151,530 


149.707 


4 



For GA overhangs (15 Fragments) 



Drrfl# Location 



Overhang 



6. 


25.229 


GA* 


14. 


169.979 


GA* 


6. 


91,681 




5. 


97330 




4. 


29,833 


GA* 


4. 


66,836 




13. 


166,309 


GA* 


9. 


132,110 


GA* 


8. 


139,856 




11. 


153,548 


GA* 


4. 


42,388 


GA* 


9. 


102.318 




10. 


107.020 




10. 


138.792 




8. 


105.928 





Complement 


Nearest 


Nearest 


Fragment 
Length 


Mspl 


Taql 




31.564 


30,045 


4,816 




179562 


174481 


4502 


GA* 


88.256 


81,884 


3.419 


GA* 


94353 


89615 


2977 




41.626 


31,251 


1.418 


GA* 


65.504 


62,654 


1.332 




167668 


166451 


1311 





133.806 


132.976 866 


GA* 


139.346 


139,218 510 




153.789 


160,722 241 




42.584 


Drdl (42.586) (196) 


GA* 


98.975 


102,155 163 


GA* 


106.882 


105,288 138 


GA* 


137757 


138715 77 


GA* 


105,592 


105.920 8 



For GG overhangs ( 1 2 Fragments) 

Dr(M Location Overhang Complement 



3. 


33,306 


7. 


83.703 


12. 


154.063 


2. 


19.383 


6. 


49,989 


9. 


74,360 



GG* 
GG* 



GG* 
GG* 



GG* 
GG* 



Nearest 


Nearest 


Fragment 


Mspl 


Taql 


Length 


38,218 


40,389 


4.918 


87.372 


90.806 


3.669 


142,944 


150,402 


3,661 


13.868 


17,667 


1,710 


51.421 


51.451 


1.432 


75.697 


75.962 


1.337 



FIG. 20 (cont.) 
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1. 


5,385 


GG* 




6.381 


6,249 


864 


13. 


180,666 




GG* 


179,917 


177,380 


749 


3. 


43,961 


GG* 




48,573 


44,652 


691 


2. 


41,448 


GG* 




42,084 


42.010 


562 


10. 


137,833 




GG* 


137,329 


136,062 


504 


5. 


89,627 




GG* 


80,801 


89.331 


294 



FIG. 20(cont) 
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Determining four unique singlet Drdl sequences from two overlapping doublet BAC sequences. 

Concordant sequences : Doublet to Doublet. 

1 . TCGTCCTCAGGAACTGAAGCTATATAATCAGTTAAGTCCCTGCTTC 

2 . GTGTCAAGTAAAGAAGTACAGCAGATAAGTAAAACGGAAAAAAATAATC^ 

d dSSS dsiisdsisdsdssididddisiidsidssssdddsiddSiiddiddiiddSississdidsddddsdssdSdis 

3 . AAGTCTACAATCAAGAGGCCAACTGATTCCATGTCTGGTGAGGGTCTATO 

2 , GTGTCAAGTAAAGAAGTACAGCAGATAAGTAAAACGGAAAAAAATAATGAAAGAATTACA^ 

From above 2 BACs, sequence #2 is: 

CAA CA-ATCT GCTTCTGTT T 

2 ? GTGTCAAGTAAAGAAGTACAGCAGATAAGTAAAACGGAAAAAAATAATGAAAGAATTACAAAGGAAGACTAAGGAAAGA^ 

Concordant sequences : Doublet to Doublet , 

3 . AAGTCTACAATCAAGAGGCCAACTGATTCCATGTCTGGTGAGGGT^ 

2 . GICTCAAGTAAAGAAGTACAGCAGATAAGTAAAACGGAAAAAAAT 
dsSSSdsssSsddsiddisdisdsisisddsisisdsdisSsddsssdsdiddddisdsssSSsiisdiidsddsdsdss 

3 . AAGTCT ACAATC AAGAGGCCAACTGATTCCATGTCTGGTGAGGGTCTATTTCCTGGTGC ATAGA TGGC 

4 . TAGTCCTCAMTTCACCATGGATTAAATAACAGAAC 

From above 2 BACs, sequence #3 is : 

3 ? AAGTCTACAATCAAGAGGCCAACTGATTCCATGTCTGGTXSAGGGTCTATTTCCTGGTGCATAGATC 

AAGAAAAA AT AACT 

CAA CAATCT GCT TCTGTT T 

2 ? GTGTCAAGTAAAGAAGTACAGCAGATAAGTAAAACGGAAAAAAATAATGAAAGAATTACAAAGGAAGACTAAGG^ 

By comparing the consensus sequence between 2 and 3, one can determine the overlap. 
In this case, only two positions are indeterminate (A or T) . Hence 2 and 3 are: 

T T 

2 = GTGTCAAGTAAAGAAGTACAGCAGATAAGTAAAACGGAAAAAAATAATGAAAGAATTACAAAGGAAGACT AAGG AAAG 

A A 

3 = AAGTCTACAATCAAGAGGCCAACTGATTCCATCTCTGGTGAGGGTCTATTTCCTGGTGC 

and by subtraction, one can determine 1 and 4: 

A A 
1 = TCGTCCTCAGGAACTGAAGCTATATAATC AGTTAAGTCCCTGCTTCTGATCTCTTCTGATTTTC 

T . T 

4 = TAGTCCTCAATTTCACCATGGATTAAATAACAGAAC ACAGAGTTACTGTGAGACTTGTGGTAGAAAATC^^ 

FIG. 21 
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Determining three unique singlet Drdl sequences from overlapping doublet and triplet BAC 
sequences. 

Concordant sequences: Doublet to Doublet. 

1 . TCGTCCTCAGGAACTGAAGCTATATAATCAGTTAAGTCCCTGCTTC 

2 . GTGTCAAGTAAAGAAGTACAGCAGATAAGTAAAACGGAAAAAAATAAl^^ 

ddSSSdsiisdsisdsdssididddisiidsidssssdddsiddSiiddiddiiddSississdidsddddsdssdSdis 

3 . AACTCTACAATCAAGAGGCCAACTGATTCCATGTCTGGTGAGGGTC 

2 . GTGTCAAGTAAAGAAGTACAGCAGATAAGTAAAACGGAAAAAAATAATC 

From above 2 BACs, sequence #2 is: 

CAA CAATCT GCTTCTGTT T 

2 ? GTGTCAAGTAAAGAAGTACAGCAGATAAGTAAAACGGAAAAAAATAATGAAAGAATTACAAAGGAAGACTAAGGAAAGAG 

Concordant sequences: Doublet to Triplet. 

3 . AAGTCTACAATCAAGAGGCCAACTGATTCCATGTCTGGTGAGGGTC 

2 . GTGTCAAGTAAAGAAGTACAGCAGATAAGTAAAACGGAAAAAAA 

iiSSSdssiSdddsiddisdisidisiidisidisdsdissiiisisisdiiddiisdssisssiisiiiddddddsdds 

3 . AAGTCTACAATCAAGAGGCCAACTGATTCCATG^^ 
4 . TAGTCCTCAMTTCACCATGGATTAjy^ 

5 . GTGTCATCTAGCTATAAATCTAAAGATAATAATAAAATTGGAAAGAriT 
From above 2 BACs, sequence #3 is: 



3 ? AAGTCTACAATCAAGAGGCCAACTGATTCCATGTCTGGTGAGGGTCTATTTCCTGGTGCATAGAT^ 
GT T AAGAAAATAA A AAA A T AA AT A AA ACT 

CAA CAATCT GCTTCTGTT T 

2 ? GTGTCAAGTAAAGAAGTACAGCAGATAAGTAAAACGGAAAAAAATAATGAAAGAATTACAAAGGAA^ 

By con^5aring the consensus sequence between 2 and 3, one can determine the overlap. 
In this case, only two positions are indeterminate (A or T) . Hence 2 and 3 are: 

A T T G C T T 

2 = GTGTCAAGTAAAGAAGTACAGCAGATAAGTAAAACGGAAAAAAATAATGAAAGAATTACAAAG 

T A A A A A A 

3 = AAGTCTACAATCAAGAGGCCAACTGATTCC ATGTCTGGTGAGGGTCTATTTCCTGGTGCAT^^ 



and by subtraction, one can determine 1 is: 



T A A A A A A 

1 - TCGTCCTCAGGAACTGAAGCTATATAATCAGTTAAGTCCCTGCTTCTGATCTCTT^ 



From the above data, one cannot determine sequence 4 & 5, although one can reduce it 
to a doublet sequence by subtracting sequence 3. The alignment of this triplet BAC 
with another singlet or doublet from the neighboring BAC on the other side (i.e. 5 
alone or 5 & 6 doublet) will allow one to decipher sequences 4, 5, and 6 

FIG. 22 
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Bga:Drdl, and Sapl sites in the pBeloBACl 1 cloning vector. 



1000 2000 3000 4000 5000 6000 7000 , 
I I I I— 1 1 1 ' 



Bqll 4 I I 

DrdI 4 

Saol 2 



BgM Location 



1. 
2. 
3. 
4. 



155 
634 
2,533 
6.982 



Nearby 

Site 

Fspl 



Overhang 

(BgU) 

TTC 

CCC 

TGT 

TGC 



(Bgn.) 

GAA 
GGG 
ACA 
OCA 



Site 
Narl 



Overlapping Complement Nearby 
Site 

Xmal 
Stul 

NgoMTW 



DrtM Location 



1. 
2. 
3. 
4. 



1,704 
2,616 
3,511 
4,807 



Nearby 
Site 

A/wNI 



Overhang 
(Drrfl) 

AA 
TC 
GA 
TG 



Overlapping Complement Nearby 
Site {Drd\) Site 



Bspm 



TT 
GA 
TC 
CA 



SapW Location 



1. 
2. 



3,964 
5,174 



Nearby 
Site 

Oral 



Overhang 
.{SapD 

TAT 
ACT 



Overlapping Complement Nearby 
Site {Sapl) Site 



ATA 
AGT 



BcR 



FIG. 23 
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Bgfl, Drdl, and Sapl sites in the pUC19 cloning vector. 



250 500 750 1000 1250 1500 1750 2000 2250 2500 
I r I I I ' ' I I L_ 



Bgll 2 
DrdI 2 



BgM Location 



1. 
2. 



429 
1,547 



Nearby Overhang Overlapping Complement Nearby 

Site (Bgn.) Site (figfl) Site 

Narl GAA TTC Fspl 

TTC Mspl GAA 



DrdW Location 



1. 582 

2. 2,450 



Nearby 
Site 



Overhang 
(Drdi) 

GC 
GA 



Overlapping Complement Nearby 
Site (Drdl) Site 



Taql 



GC 
TC 



Sapl sites: None 



FIG. 24 
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Sequencing BamH\ islands in random BAC clones 



BamHl 



BamHl 



l.CmBACDNA with 
BamHl in the presence of 
linkers and T4 ligase. 
Linker for BamHl site is 
not phosphorylated. 
Biochemical selection 
assures that most sites 
contain linkers. 



I 

•GGATCC- 
• CCTAGG • 
t 



4 

■ GGATCC- 

■ CCTAGG- 

1 



. Inactivate T4 ligase and 
BamHl endonuclease at 
65**C for 10 min, melt off 
unligated linker strand. 
Add Tag polymerase and 
dNTPs and fill in 3* ends. 
PCR amplify using primers 
containing ribose U 
replacing dT, dNTPs, and 

polymerase. Add 
O.lNNaOH and heat to 
95 **C for 5 min to destroy 
unused primers. 



5* . 

5' . 

3' 



. AGATCC ■ 
• TCTAGG • 



• GGATCT- 
• CCTAGA- 

a!- 



. AGATCC ' 
• TCTAGG " 



GGATCT- 
CCTAGA- 



AA LA 



3. Neutralize and dilute. 
Anneal sequencing primer 

which extends past the 
BamHl site linker by two 
bases and perform a cycle- 5' aa 

sequencing reaction, 5' AGATCC GGATCT 3* 

(Separate reactions are 3 TCTAGG CCTAGA 5. 

performed using primers 
containing other two base 
extensions). 



FIG. 25 
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EcoRl, HindUl, and Bam HI site frequencies in a sequenced B AC from 7q3 1 . 

RG253B 1 3. 7q3 1 Met Oncogene 
19 BamHl Sites in 171,905 bp 



25000 50000 75000 100000 
I I I I \ 



125000 150000 
I I 



BamHl 19 1 II 1 1 


II 1 II 






EcoR i 49 11 1 II 1 i III 1 M 






1 




1 1 III 


1 i 1 




Hind III 64 1 III III III 1 1 II M 


1 


1 1 ll III! 


1 




II 1 





Enzyme 



Freq 



Position(s) 



BamH I 
i 

G GATC C 
C CTAG G 

T 



19 



39474 
61411 
86169 
132953 
169171 



53874 
63629 
97907 
156707 
170414 



53955 
74716 
100558 
159016 
170908 



58547 
82491 
120206 
165913 



Number of fragments 4 kb or less: 9 



BamHl Location* 1 



1. 

2. 
3. 
4. 
5. 
6. 
7. 
8. 
9. 



53,874 
58. 547 
61,411 
82,491 
97,907 
156,707 
165,913 
169,171 
170,414 



Location#2 

53,955 

61,411 

63,629 

86.169 
100,558 
159,016 
169,171 
170,414 
170,908 



+ 2 bases 

AT" 

TA® 

AC® 
CA® 

TC* 
CT* 



Complement + 2 bases 

AA® 
AT'' 
CT* 

AG® 
AT'' 
TC* 



Clusters: (2, 3); (7. 8, 9) 

®Same + 2 bases next to site within BAC used exactly once (singlet). 
'Same + 2 bases next to site within BAC used exactly twice (doublet). 
"Same + 2 bases next to site within BAC used more than twice. 



BamHl 
6 
2 

2 



FIG. 26 
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25000 50000 75000 100000 125000 150000 
\ I I I I l__ 



BamH I 19 
EcoR I 49 
Hind III 64 



1 1 III i 1 


II ! 1 II li 


III III 1 III 1 1 


1 1 


1 III 1 


1 III 


1 IBI 


II 1 II nil III 1 1 


1 ill 


1 9 li 


III 


ii| 1 1 !! If 



EcoR I 


49 




: 2446 


4350 


6140 


6158 


i 






: 6225 


10073 


12053 


12399 


G AATT C 






: 15083 


28087 


41401 


43549 


C TT7A G 






: 43806 


46037 


53312 


62042 


T 






: 65700 


72180 


77101 


81978 








: 86301 


91655 


93891 


94983 








: 95739 


96841 


97167 


99214 








114696 


114949 


115133 


115232 








120578 


122208 


126085 


127496 








128732 


129314 


130523 


130710 








131286 


134360 


150100 


162281 








167783 


169521 


169653 


170292 








170998 








Number of fragments 4 


kb or less: 34 








Hind III 


64 




1 


321 


4834 


5918 


- i 






7959 


14843 


16895 


. 18994 


A AGCT T 






32159 


33703 


38308 


41512 


T TCGA A 






. 44158 


44521 


44717 


46402 


T 






48209 


48692 


52752 


55612 








57379 


57727 


65779 


70218 








70601 


71947 


73380 


75933 








77773 


78860 


80726 


94474 








94886 


102267 


102578 


112246 








113833 


120486 


121556 


121647 








124186 


124409 


124818 


126795 








134126 


136011 


137970 


140077 








141184 


143075 


145328 


146005 








146673 


148906 


150711 


150993 








151617 


157093 


160311 


162518 








166369 


166672 


169514 


171900 



Number of fragments 4 kb or less: 52 



FIG. 26(cont.) 
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Avrll, Nhel, and Spel site frequencies in a sequenced BAC from 7q3 1. 

RG253B13,7q31 Met Oncogene 

25 Avrn, 22 Nhel and 21 Spel Sites in 171.905 bp 



25000 50000 75000 100000 125000 150000 
I I I I I I I I 

Avr II 25 = ! j !l ; !' : ^ i Mi II I J 

Nhe I 22 ; i i i \ ' -j ' - J ' 1 i i 

Spe I 21 i = i i I III M I I t ! I j li . I T 



Enzyme Freq Position (s) 



Avr II 


25 




7350 


7990 


. 11781 


41276 


4. 






56073 


56739 


71378 


80285 


C CTAG G 




80378 


80418 


81455 


92044 


•3 GATC C 




95088 


106812 


132860 


133491 




T 




138089 


138866 


138891 


138919 








158473 


159109 


163153 


163762 








168991 








Number 


of fragments 


4 kb or less: 14 


(Clustering) 




Avrll 


Location* 1 


Location#2 + 2 bases 


Complement + 2 bases 


1. 


7,350 


7,990 






AA" 




2. 


7,990 


11,781 


cc® 




CT'' 




3. 


56,073 


56,739 


CA" 




TG® 




4. 


80.285 


80,378 


TT" 




AC* 




5. 


80,378 


80,418 


CA 




CA (40 bp fragment) 


6. 


80.418 


81.455 


AC* 




AA* 




7. 


92,044 


95.088 


GG® 




TC® 




8. 


132.860 


133.491 






AA" 




9. 


138.089 


138,866 


CT'' 




TT" 




10. 


138,866 


138.891 


TG 




TG (25 bp fragment) 


11. 


138.891 


138.919 


CT 




AG (28 bp fragment) 


12. 


158,473 


159, 109 


AA" 




TT^ 




13. 


159, 109 


163,153 


CA* 




TA® 




14. 


163,153 


163,762 


AA" 









Clusters: (4, 5. 6); (9, 10, 11); (13, 14) 

AvrU 

®Same + 2 bases next to site within BAC used exactly once (singlet). 5 
*Same + 2 bases next to site within BAC used exactly twice (doublet). 2 
^Same + 2 bases next to site within BAC used more than twice. 3 



FIG. 27 
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Avrll 
Nhe I 
Spe I 



1 


25000 
1 


50000 75000 
1 1 


100000 

1 


125000 
1 


150000 

1 1 




25 1 


1 


1 II ! !i 


• i : 


II 


! !1 M 


?9 i 1 ill ! il i il i • ; i! i 


i i ! 


21 i I 




i ! II i M 


! 1 


! 1 ! 


1 ii > i 



Nhe 



I 

i 

G CTAG C 
C CATC G 
T 



22 



7114 
38661 
64798 
82697 
134667 
161975 



10879 
51766 
68351 
91479 
134793 
167497 



22730 
58900 
71494 
106192 
137390 



Number of fragments 4 kb or less: 10 (Clustering) 



29080 
62751 
73609 
132980 
158989 



Nhel 


Location#l 


Location#2 


+ 2 bases 


1. 


7,114 


10,879 


IT 


2. 


58,900 


62,751 


TG* 


3. 


62,751 


64,798 


AC'' 


4. 


64,798 


68,351 


TC" 


5. 


68,351 


71,494 


AC" 


6. 


71,494 


73,609 


TA® 


7. 


132,980 


134,667 


CA'' 


8. 


134,667 


134,793 


GG® 


9. 


134,793 


137,390 


TT» 


10. 


158,989 


161,975 


CA" 



Complement + 2 bases 

TC" 

CA" 

AC" 

TC" 

TG' 

CA" 

AA® 

AG* 

AC" 

AG" 



Clusters: (3, 4, 5, 6); (7, 8, 9) 

®Same + 2 bases next to site within BAG used exactly once (singlet). 
'Same + 2 bases next to site within BAC used exactly twice (doublet). 
"Same + 2 bases next to site within BAC used more than twice. 



Nhel 
3 
3 
3 



FIG. 27(cont.) 
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25000 50000 75000 100000 125000 150000 

—I 1 1 ! \ I I 



Avrll 25 1 II! 


II 11 1 1 


ll II 1 


Nhe 1 22 


1 II III 


1 


1 1 1 


1 


Spe 1 21 1 


1 Hi 


1 


II ll 1 i 1 i 





Spe I 


21 : 


3173 


7256 


29438 


50198 


i 






54057 


63422 


64771 


68328 


A CTAG T 




72447 


76712 


88296 


104546 


T GATC A 




121378 


124275 


132360 


139059 




T 




139107 


148566 


150563 


159612 








169084 








Number 


of fragments 


4 kb or 


less: 9 


(Clustering) 




Spel 


Location#l 


Location#2 + 2 bases 


Complement + 2 bases 


1. 


3,173 


7,256 


TC 




GA^ 


2. 


50,198 


54,057 


TG' 




GG^ 




3. 


63,422 


64,777 


GA^ 




GG"" 




4. 


64,777 


68,328 


CA® 




GG'' 




5. 


68,328 


72,447 










6. 


72,447 


76,712 


GT® 




GC® 




7. 


121,378 


124.275 


GA^ 




TC* 




8. 


139.059 


139,107 


AT 




AC (48 bp fragment) 


9. 


148,566 


150.563 


TG* 




TT^ 





Clusters: (3, 4, 5. 6) 

®Same + 2 bases next to site within BAC used exactly once (singlet). 3 

"Same + 2 bases next to site within BAC used exactly twice (doublet). 3 

^Same + 2 bases next to site within BAC used more than twice. 3 



FIG. 27(cont) 
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Sequencing Bs/HKAI islands in random BAC clones 



l.Cul BAC DNA with 
fljiHKAI in the presence of 
linkers and T4 ligase. 
Linker for fljiHKAI site is 
phosphorylated and 
contains a 3' AGCA 
overhang. Biochemical 
selection assures that most 
sites contain tinkers. 



2. Inactivate T4 ligase and 
BsiHKAl endonucleasc at 
95**Cfor5min. PGR 
amplify using primers 
containing ribose U 
replacing dT, dNTPs. and 
Tfl^ polymerase. Add 
G.lNNaOH and heat to 
95 *C for 5 min to destroy 
unused primers. 



. Neutralize and dilute. 
Anneal sequencing primer 
which extends past the 
fljiHKAl site linker by two 
bases and perform a cycle- 
sequencing reaction. 
(Separate reactions are 
performed using primers 
containing other two base 
extensions). 



5*- 
3" 



BjiHKAI 
I 

■ GWGCWC- 
- CWCGWG" 

r 



fiWHKAl 
4 

■ GWGCWC - 

■ CWCGWG - 

1 



1 



5' 

5' . 
3' ' 



— AGCA 

. AAGCAC • 
•TTCGTG- 



■ GTGCTT- 
• CACGAA- 

•acga - 



T T TT 



. AAGCAC • 
• TTCGTG ' 



GTGCTT- 
CACGAA- 



▲A ▲ ▲ 



• AA 



AAGCAC ' 
• TTCGTG • 



GTGCTT- 
CACGAA 



FIG. 28 
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Accl and 5j/HKAI site frequencies in a sequenced B AC from 7q3 1 . 

RG253B 1 3, 7q3 1 Met Oncogene 

71 Accl and 127 BsiHKAl Sites in 171,905 bp 



25000 50000 75000 100000 125000 150000 
__l I : I I 1 I 



Accl 71 ! il ; 1 1 ill ii llli ! 


Il II Ill 




i i 




hi ill I N II 


II 1 il 




BsiHKAl 127 i HI ill ii Mill III! 


Ii ill i il 111 


1 


! Ill 


1 


1 iiiiiiiiiii 1 III III 


11 1 iiiiii 





Enzyme 
Acc I 



Freq 
71 



Position (s) 



GT MK AC 
CA KM TG 
T 



523 


5182 


6465 


9711 


12950 


13976 


15332 


16332 


19814 


21540 


22269 


22322 


26959 


28705 


32048 


32661 


33298 


33310 


34799 


35425 


42895 


44110 


46004 


47636 


47861 


52446 


54000 


58216 


58826 


65238 


66475 


69750 


71833 


72783 


74938 


75538 


77087 


77368 


77642 


80744 


82917 


87470 


91592 


96498 


98545 


100882 


100965 


104551 


104725 


105186 


109580 


110415 


112720 


114135 


114242 


120913 


127597 


131831 


137724 


139036 


141043 


142923 


142963 


145284 


149681 


155647 


157032 


160140 


165449 


167062 


167292 





AG#l+2 AG#2 + 2 

TT AV 
(10 bp fragment) 
(Too long) 

TT AA® 

CC® AT® 

Ar TG® 

Accl 

®Same + 2 bases next to site within BAG used exactly once (singlet). 4 
*Same + 2 bases next to site within BAG used exactly twice (doublet). 2 
'^Same + 2 bases next to site within BAG used more than twice. 0 



Accl# 


Location#l 


Location#2 


1. 


13,976 


15,332 


2. 


33,298 


33,310 


3. 


35.425 


42,895 


4. 


69,750 


71,833 


5. 


96,498 


98,545 


6. 


109,580 


110,415 



FIG. 29 
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1 


25000 50000 


75000 
1 


100000 

1 


125000 150000 
1 t 














Acc 1 


71 i li 


1 Hi II II !( i; i 


!M nil 1 ! i 


1 1 ; ! i II 


! 1 i 11 ii M !l 1 II 


BsiHKA 1 


127 


\r- =11^1 Ihlllll $11! i 




III! { W iillil II 


III M: . 13: IliilS Mlilll! t i !; ! 



Enzyme 
BsiHKA I 

i 

G WGCW C 
C WCGW G 

T 



Freq 
127 



Position (s) 



1200 


1274 


3588 


4 OlU 


6151 


9251 


9358 


I Uo7 1 


11182 


1204 6 


23820 




26538 


29546 


31865 


J34 17 


3362C 


33828 


34406 


34818 


35750 


39076 


39888 


40291 


41356 


41605 


4 1622 


41723 


42439 


43101 


43155 


43959 


44003 


44572 


46346 


47692 


48495 


48608 


49119 


51943 


52138 


52540 


53172 


53348 


54384 


56608 


61639 


61987 


68891 


69195 


70155 


73864 


74122 


75448 


76167 


77810 


78326 


78825 


81275 


81950 


82251 


82594 


87958 


89375 


90017 


91434 


91584 


93846 


94001 


96276 


97766 


97942 


102220 


104114 


105012 


106105 


107321 


108501 


111466 


112396 


113542 


114132 


115157 


116106 


118786 


120094 


122269 


122357 


122376 


122400 


125590 


128460 


130102 


130144 


130366 


131806 


135930 


137267 


137611 


139881 


141326 


141747 


143572 


143995 


144453 


144701 


147329 


148398 


150702 


150741 


151888 


153643 


154630 


155122 


156946 


157058 


160171 


160400 


164987 


167605 


167618 


167660 


167683 


168011 


168643 


168776 


171471 





B5jHKAILocation#l Location#2 AGCA#l+2 AGCA#2 + 2 



1. 


3,588 


4,610 


AC 


TT* 
TG' 


2. 


23,820 


26,072 


AA« 


3. 


43,959 


44.003 


TT 


AA (44 bp fragment) 


4. 


48,608 


49,119 


AG® 


GA* 


5. 


52,138 


52,540 


CT« 


GG® 


6. 


76,167 


77,810 


AC* 


TT* 


7. 


102.220 


104,114 


CC 


CC (24 bp fragment) 
TG" 


8. 


155.122 


156,946 


AT** 



Bj/HKAI 

^Same + 2 bases next to site within BAG used exactly once (singlet). 6 
"Same + 2 bases next to site within BAG used exactly twice (doublet). 3 
''Sanne + 2 bases next to site within BAG used more than twice. 0 

FIG. 29(cont) 
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Sequencing SanDl islands In random BAC clones 



l.CuiBAC DNA with 
Msp\ and SanDl in the 
presence of linkers and T4 
ligase. Linker for SonDl 
site is phosphorylated and 
contains a 5' GTC 
overhang. Linker for M^/jI 
site is not phosphorylated. 
and contains a bubble. 
Biochemical selection 
assures thai most sites 
contain linkers. 



SanDl 
i 

- GGGWCCC - 
-CCCWGGG- 
1 



• CCGG 

• GGCC 

t 



5'< 

5' 

3" 



.AUGAC 

•ATGACCC- 
•TACTGGG- 




2, Inactivate T4 ligase and 
restriction endonucleases at 
gS'^CforSmin. PCR 
amplify using primers 
conuining ribose U 
replacing dT, dNTPs, and 
Tii^ polymerase. Add 
O.lNNaOH and heal to 
95 °C for 5 min lo destroy 
unused primers. 



5' < 
3'" 



T T 



JLaugaccc- 

TACTGGG- 



CCGT - 
■GGCA- 



3. Neutralize and dilute. 
Anneal sequencing primer 



which extends past the 
SanDl site linker by iwo 

bases and perform a cycle- 5* ^ „«™ 3' 

sequcncmg reaction. ^ tactGGG ^^^^ 5' 

(Separate reactions arc 
performed using primers 
containing other two base 
extensions). 



FIG. 30 
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SanDl and SexAl site frequencies in a sequenced BAC from 7q31 . 

RG253B13, 7q31 Met Oncogene 

13 SanDl and 15 SexAl Sites in 171,905 bp 



25000 50000 75000 1 00000 1 25000 1 50000 
1 I I I I L_ 



SanDl 13 i! I I II i 

SexAl 15 il I I I I II I II 



Enzyme 
SanD I 
i 

GG GWC CO 
CO CWG GG 
T 



Freq 
13 



Position (s) 

9761 10644 36269 40440 

58583 66380 99267 119927 

122060 128057 137082 140964 
143225 



5a7iDI#Location 


GAC + 2 bases 


1. 9,761 


CT* 


2. 10,644 


TC* 


3. 36,269 


AC* 


4. 40,440 


TC* 


5. 58,583 


TG* 


6. 66,380 


^ CA® 


7. 99,267 


TG» 


8. 1 19,927 


AT® 


9. 122,060 


CG® 


10. 128,057 


TA* 


11. 137,082 


AC" 


12. 140,964 


cr 

TA* 


13. 143,225 



SanDl 

®Same + 2 bases next to site within BAC used exactly once (singlet). 3 
*Same + 2 bases next to site within BAC used exactly twice (doublet). 5 
'^Same + 2 bases next to site within BAC used more than twice. 0 



FIG. 31 



SUBSTITUTE SHEET (RULE 26) 



wo 00/40755 



44/104 



PCT/USOO/00144 



25000 
I 



50000 
I 



75000 
_J_ 



100000 
I 



125000 
I 



150000 
I 



SanD I 
SexAl 



13 
15 



TT 



Enzyme 



Freq 



Position (s) 



SexA I 


15 : 9499 


i 


: 54773 


A CCWGG T 


: 78279 


T GGWCC A 


: 114440 


t 




5exAI#Location 


CCAGG + 2 bases 


I. 9,499 


TG® 


2. 10,41 1 


CTX 


3. 19,691 


TT# 


4. 47,816 


CC® 


5. 54,773 


CTX 


6. 58,714 


GG® 


7. 61,533 


GC® 


8. 62,534 


TC® 


9. 78,279 


cr'^ 


10. 98,356 


IT 


11. 103,356 


AT® 


12. 1 14,268 


AA® 


13. 114,440 


GA* 


14. 142,141 


CA® 


15. 155,393 


GA* 



10411 
58714 
98356 
142141 



19691 
61533 
103356 
155393 



47816 
62534 
114268 



Same + 2 bases next to site, within BAC used exactly once (singlet). 
"Same + 2 bases next to site within BAC used exactly twice (doublet). 
^Same + 2 bases next to site within BAC used more than twice. 



SexAl 
8 
2 
1 



FIG. 31 (cont.) 
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Accl and BsiUYiAl sites in the pBeloBACl 1 cloning vector. 



1000 2000 3000 4000 5000 6000 7000 . 
I I I 1_ I ! I I 



Accl 6 I 

BsiHKAl 8 I I 



Enzyme 
Acc I 

i 

GT MK AC 
CA KM TG 
T 



Freq 
6 



Position (s) 
367 647 



1832 1891 6262 7031 



AccW Location#l Location#2 AG#l+2 
None with head to head AG overhangs. 



AG#2 + 2 



BsiHKA I 

i 

G WGCW C 
C WCGW G 
T 



91 
7048 



343 
7458 



2352 3966 5458 7040 



fijF/HKAILocation#l Location#2 AGCA#l+2 AGCA#2 + 2 
None with head to head AGCA overhangs. 



FIG. 32 
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>lvrll, Bam HI, Nhel, and Speh sites in the pBeloBACl 1 cloning vector. 



1000 2000 3000 4000 5000 6000 7000 
I I I I I I I L_ 



Spe I 1 



Avrll Location#l Location#2 + 2 bases Complement + 2 bases 

Non-cutting enzymes : 
Avr II Nhe I 

Nhel Location#l Location#2 + 2 bases Complement + 2 bases 

Non-cutting enzymes : 
Avr II Nhe I 



Enzyme Freq Position (s) 

Spe I 1 : 6090 

i : 

A CTAG T : 

T GATC A : 
T 



Spel Location#l Locat!on#2 +2 bases Complement + 2 bases 
No PCR vector fragment under 4 kb. 



1000 2000 3000 4000 5000 6000 7000 
_J I I I I 1 1— 



BamH I 1 



Enzyme Freq Position (s) 

BamH I 1 : 354 

I : 
G GATC C : 
C CTAG G : 
T : 

BamUl Location# 1 Location#2 + 2 bases Complement + 2 bases 
No PCR vector fragment under 4 kb. 



FIG. 33 
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SanDl and 5^jcAI sites in the pBeloBACl 1 cloning vector. 



1000 2000 3000 4000 5000 6000 7000 , 
I I I I -J I I 1 1 



SexAl 1 



5anDI#Location A + 2 bases 

Non-cutting enzymes : 
SanD I 



Enzyme Freq Position (s) 

SexA I 1 : 6968 

i 

A CCWGG T 
T GGWCC A 
T 



SejcAI#Location CCAGG + 2 bases 
1. 6,968 AT 



FIG. 34 
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75000 
I 



100000 
I 



125000 



150000 
I 



Bgll 16 
Drd I 12 



I I 



rzi 



I ! 



1 — r 



! i 



I t 



I II 



Mspl 86 1 III 1 II ! null Ml III 


11 II 1 1 1 M 11 > II 


i 




Sapl 25 1 1 II IN 1 1 




1 i 


TaqI 62 | 


II II 1 1 1 II II II 1 


1 


! u 1 1 nil 


llllii 


II 




1 


1 



Sequence 



BAG RG253B13.seq ( 1 > 171905 ) 



201 Cut Silas 



822 


Msp I 


45534 


Drd I 


112720 


Taq I 


1205 


Sap I 


46540 


Taq I 


112849 


Sap I 










112925 


Msp I 


1455 


Sap I 


51614 


Msp I 






1 T it /I 

1/44 


Taq I 


51629 


Bgl I 


112992 


Taq I 


5385 


Drd I 


51633 


Msp I 


113048 


Drd I 


6249 


Taq 1 






113429 


Taq I 






54087 


Msp I 




10904 


Taq I 


56037 


Sap I 


127710 


Taq I 


10942 


Sap I 


57978 


Msp I 


129198 


Bgl I 






58314 


Bgl I 


129286 


Taq I 


10962 


Sap I 


58886 


Taq I 














134499 


Taq I 


11040 


Sap I 


64973 


Msp I 


135480 


Sap I 




Taq I 


68849 


Sap 1 


135615 


Sap I 






70528 


Drd I 


135890 


Msp I 


12529 


Taq I 


71393 


Msp I 


136246 


Sap I 


1 Jboy 


Bgl I 






136580 


Taq I 


1 >i (i "3 "3 


Msp I 


76855 


Taq I 


137177 


Drd I 








Sap I 


137473 


Msp I 




wsp i 


7704 1 


Msp I 




25121 


Bal I 






1 O / ' Z 0 


wsp 1 


25228 


Msp I 


88256 


Msp I 


137753 


Bgl I 






88322 


Bgl I 


137987 


Msp I 


26363 


Msp I 


91681 


Drd I 




26871 


Drd I 


92596 


Taq I 


140465 


Taq I 


31038 


Sap I 


94140 


Bgl I 


142242 


Sap I 


31368 


Taq I 


95752 


Msp I 


142402 


Taq I 






96506 


Drd I 




31440 


Taq I 


97059 


Taq I 


147961 


Taq I 


32606 


Sap I 






148482 


Sap I 


33306 


Drd I 


99370 


Taq I 


149252 


Bgl I 


33896 


Bgl I 


99.469 


Bgl I 


150256 


Taq I 


37052 


Sap I 


99513 


Msp I 




38218 


Msp I 


99628 


Drd I 


156469 


Msp I 


38938 


Sap I 


100051 


Bgl I 


156583 


Bgl I 


39316 


Msp I 


100257 


Sap I 


157032 


Taq I 


39876 


Sap I 


101248 


Taq I 


157977 


Sap I 


40280 


Msp I 


101440 


Drd I 


158054 


Msp I 






103234 


Taq I 


159685 


Drd I 


42440 


Msp I 






160038 


Msp I 


44332 


Sap I 


105244 


Msp I 




44477 


Msp I 


106619 


Bgl I 


160434 


Msp I 






109494 


Taq I 


160832 


Sap I 


45343 


Taq I 






161212 


Taq I 



FIG. 35 
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PCTAJSOO/00144 



161237 


Msp 


I 


161467 


Bgl 


I 


162462 


Taq 


I 


165127 


Taq 


I 


165703 


Bgl 


I 


165714 


Msp 


I 


166152 


Sap 


I 


166163 


Msp 


I 


168336 


Taq 


I 


171459 


Sap 


I 



FIG. 35(cont) 
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Drdl site: For AA, AC, AG, CA, GA, and GG overhangs 



DrdlU 


Location 


Overhang 


Complement 


Nearest 


Fragme 










Mspl or Taql 


Length 


1. 


5,379 


GG* 


CC 


6,249 


864 


2. 


26,865 


GT 


AC* 


26,363 


502 


3. 


33,300 


GG* 


CC 


38,218 


4,9 18# 


4. 


45,528 


AT 


AT 




5. 


70,522 


AT 


AT 






6. 


91,675 


TC 


GA* 


88,256 


3.419 


7. 


96,500 


CA* 


TG 


97,059 


559 


8. 


99.622 


CT 


AG* 


99,513 


115 


9. 


101,434 


TT 


AA* 


101,248 


192 


10. 


113,042 


AC* 


GT 


113,429 


381 


11. 


137,171 


TT 


AA* 


136,580 


597 


12. 


159,679 


AG* 


CT 


160.038 


353 



* To obtain sequence information on AA, AC, AG, CA, GA, or GG overhangs in the sense 
direction, the Drdl island is aniplified using a downstream Mspl or Taql site. For such two base 
sequences on the complementary strand, the Drdl island is amplified using an upstream Mspl 
or Taql site. 

Same last 2 bases of 3' overhang within BAC used exactly once (singlet). 3 
Same last 2 bases of 3' overhang within BAC used exactly twice (doublet). 3 
Same last 2 bases of 3' overhang within BAC used more than twice. 0 



Drdl site: For TT, GT, CT, TG, TC, and CC overhangs 





Location 


Overhang 


Complement 


Nearest 


Fragment 










Mspl or Ta^I 


Length 


1. 


5,379 


GG 


CC* 


1,744 


3,635 


2. 


26,865 


GT* 


AC 


31,368 


4,503# 


3. 


33,300 


GG 


CC* 


31.440 


1,860 


4. 


45,528 


AT 


AT 




5. 


70,522 


AT 


AT 






6. 


91,675 


TC* 


GA 


92,596 


921 


7. 


96,500 


CA 


TG* 


95,752 


748 


8. 


99,622 


CT* 


AG 


101,248 


1,626 


9. 


101,434 


TT* 


AA 


103,234 


1,800 


10. 


113,042 


AC 


GT* 


112.992 


50# 


11. 


137,171 


TT* 


AA 


137,473 


302 


12. 


159,679 


AG 


CT* 


158,054 


1,625 



* To obtain sequence information on TT, GT, CT, TG, TC, or CC overhangs in the sense 
direction, the Drdl island is amplified using a downstream Mspl or Taql site. For such two base 
sequences on the complementary strand, the Drdl island is amplified using an upstream Mspl 
or Taql site. 

Same last 2 bases of 3' overhang within BAC used exactly once (singlet). 2 
Same last 2 bases of 3' overhang within BAC used exactly twice (doublet). 3 
Same last 2 bases of 3* overhang within BAC used more than twice. 0 

# Fragment too small to give interpretable sequence (>80), or too large to amplify properly. 



FIG. 35(cont) 
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Bga site: For AAN, CAN, GAN, TAN, AGN, CGN, GGN. and TGN overhangs 



BgM 


Location 


Overhang 


Complement 


Nearest 


Fragment 










Mspl or TaqI 


Length 


1. 


13.833 


TGT* 


ACA 


14,933 


1,100 


2. 


25,115 


ACA 


TGT* 


22,165 


2,950 


3. 


33.890 


GAA* 


TTC 


37,052 


3,162 


4. 


51.623 


TGT* 


ACA 


51,633 


I0# 


5. 


58,308 


CTA 


TAG* 


57,978 


330 


6. 


88,316 


TTA 


TAA* 


88,256 


60# 


7. 


94,134 


GGG* 


CCC 


95,752 


1,618 


8. 


99,463 


ACA 


TGT* 


99,370 


93 


9. 


100,045 


ACC 


GGT* 


99,628 


417 


10. 


106,613 


CCA 


TGG* 


105.244 


1.369 


11. 


129,192 


TGT* 


ACA 


129,286 


94 


12. 


137,747 


TCT 


AGA* 


137,728 


19# 


13. 


149.246 


TGT* 


ACA 


150,256 


110 


14. 


156,577 


TTT 


AAA* 


156,469 


108 


15. 


161,461 


CGA* 


TCG 


162,462 


101 


16. 


165,697 


CTG 


CAG* 


165,127 


570 



* To obtain sequence information on AAN, CAN, GAN, TAN, AGN, CGN, GGN, or TGN 
overhangs in the sense direction, the BgH island is amplified using a downstream Mspl or Taql site. 
For such three base sequences on the complementary strand, the Bgll island is amplified using an 
upstream Mspl or Taql site. 

Same? last 2 bases of 3' overhang within B AC used exactly once (singlet). 5 
Same last 2 bases of 3* overhang within BAC used exactly twice (doublet). 2 
Same last 2 bases of 3' overhang within BAC used more than twice. 1 

Bga site: For ACN, CCN, GCN, TCN, ATN, CTN, GTN, and TTN overhangs 



BgHif Location 


Overhang 


Complement 


Nearest 


Fragme 




13,833 






Mspl or Taql 


Length 


1. 


TGT 


ACA* 


12,529 


1.304 


2. 


25,115 


ACA* 


TGT 


25.228 


113 


3. 


33,890 


GAA 


TTC* 


33,306 


584 


4. 


51,623 


TGT 


ACA* 


51.614 


9# 


5. 


58,308 


CTA* 


TAG 


58,886 


578 


6. 


88,316 


TTA* 


TAA 


91,681 


3.365 


7. 


94,134 


GGG 


CCC* 


92.596 


1.538 


8. 


99,463 


ACA* 


TGT 


99,513 


50# 


9. 


100,045 


ACC* 


GGT 


100,257 


212 


10. 


106,613 


CCA* 


TGG 


109.494 


2,881 


11. 


129,192 


TGT 


ACA* 


127.710 


1,482 


12. 


137,747 


TCT* 


AGA 


137.987 


240 


13. 


149,246 


TGT 


ACA* 


148,482 


764 


14. 


156,577 


TTT* 


AAA 


157.032 


455 


15. 


161,461 


CGA 


TCG* 


161.237 


224 


16. 


165,697 


CTG* 


CAG 


165.714 


17# 



FIG. 35(cont.) 
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* To obtain sequence information on ACN, CCN. GCN, TCN, ATN. CTN, GTN. or TTN 
overhangs in the sense direction, the Bgfl island is amplified using a downstream Mspl or Taql site. 
For such three base sequences on the complementary strand, the BgR island is amplified using an 



upstream Mspl or Taql site. 

Same last 2 bases of 3' overhang within B AC used exactly once (singlet). 0 

Same last 2 bases of 3' overhang within BAC used exactly twice (doublet). 3 

Same last 2 bases of 3* overhang within BAC used more than twice. 2 



# Fragment too small to give interpretable sequence (>80), or too large to amplify properly. 



Or, alternatively, mix and match the above to include trinucleotides where the middle base of the 
upper strand is either A or C, corresponding to the 3* end of the PCR primer. 



Bgil site: For AAN, CAN, CAN, TAN. ACN, CCN, GCN, and TCN overhangs 




Bgn.it Location 


Overhang 


Complement 


Nearest 


Fragment 








Mspl or Taql 


Length 


I. 13,833 


TGT 


ACA* 


12,529 


1,304 


2. 25.115 


ACA* 


TGT 


25.228 


113 


3. 33,890 


GAA* 


TTC 


37,052 


3,162 


4. 51,623 


TGT 


ACA* 


51.614 


9# 


5. 58.308 


CTA 


TAG* 


57.978 


330 


6. 88,316 


TTA 


TAA* 


88.256 


60# 


7. 94,134 


GGG 


CCC* 


92.596 


1,538 


8. 99.463 


ACA* 


TGT 


99.513 


50# 


9. 100,045 


ACC* 


GGT 


100,257 


212 


10. 106.613 


CCA* 


TGG 


109.494 


2,881 


11. 129,192 


TGT 


ACA* 


127,710 


1,482 


12. 137,747 


TCT* 


AGA 


137,987 


240 


13. 149,246 


TGT 


ACA* 


148,482 


764 


14. 156,577 


TTT 


AAA* 


156.469 


108 


15. 161,461 


CGA 


TCG* 


161,237 


224 


16. 165,697 


CTG 


CAG* 


165,127 


570 



* To obtain sequence information on AAN, CAN, GAN, TAN, ACN, CCN, GCN, or TCN 
overhangs in the sense direction, the Bgll island is amplified using a downstream Mspl or Taql site. 
For such three base sequences on the complementary strand, the Bgll island is amplified using an 
upstream Mspl or Taql site. . 

Same last 2 bases of 3' overhang within BAC used exactly once (singlet). 3 
Same last 2 bases of 3' overhang within BAC used exactly twice (doublet). 3" 
Same last 2 bases of 3* overhang within BAC used more than twice. 1 

# Fragment too small to give interpretable sequence (>80), or too large to amplify properly. 
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For AA, AC, AG, AT, GA, GC, GG and GT overhangs 



SapW Location 


Sapl 


Ligated 


Nearest 


Fragment 






Overhang 


Complement 


Mspl or Ta^I 


Length 


1. 


1,198 


CTA 


TAG* down 


No 




2. 


1.456 


AGG 


CCT up 


No 




3. 


10,943 


GCT 


AGO* up 


10,904 


39# 


4. 


10,955 


GCT 


ACG down 


No 




5. 


11,041 


CAA 


TTG up 


No 




6. 


31,031 


AAT 


ATT down 


31,368 




7. 


32,599 


GAT 


ATC down 


No 




8. 


37,053 


AGA 


TCT up 


No 




9. 


38,931 


GGG 


CCC down 


39,316 




10. 


39.877 


ATC 


GAT* up 


39,316 


571 


11. 


44,325 


CTT 


AAG* down 


44.477 


152 


12. 


56.040 


ACA 


TGT* down 


57,978 , 


1.938 


13. 


68,850 


ACC 


GGT* up 


64,973 


3.877 


1.4. 


76,930 


GTG 


CAC* up 


76,855 


75# 


15. 


100,250 


GGG 


CCC down 


101,248 




16. 


112,850 


GAT 


ATC up 


112,720 




17. 


135,473 


ACA 


TGT* down 


No 




18. 


135,608 


GGA 


TCC down 


135,890 




19. 


136,239 


TTG 


CAA* up 


135,890 


349 


20. 


142,243 


GCC 


GGC* up 


140,465 


1.778 


21. 


148,475 


GCG 


CGC* down 


150,256 


1.781 


22. 


157,978 


TCT 


AGA* up 


157,032 


946 


23. 


160,833 


ACC 


GGT* up 


160.434 


399 


24. 


166,153 


ATT 


AAT* up 


165,714 


439 


25. 


171,460 


GTT 


AAC* up 


168,336 


3,124 



* To obtain sequence information on A A, AC, AG, AT, G A, GC, GG or GT overhangs in the sense 
direction, the Sapl island is amplified using a downstream Mspl or Taql site. For such two base 
sequences on the complementary strand, the fig/I island is amplified using an upstream Mspl or 
Taql site. 

Same last 2 bases of 3' overhang within BAC used exactly once(singlet). 3 
Same last 2 bases of 3' overhang within BAC used exactly twice (doublet). 3 
Same last 2 bases of 3' overhang within BAC used more than twice. 1 

# Fragment too small to give interpretable sequence (>80), or too large to amplify properly. 



FIG. 35(cont) 
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Three degrees of specificity in amplifying a Bgll representation. 



1. Ligation of the top strand requires perfect complementarity at the penultimate 
base to the 3' side of the junction (20-foid specificity). 

2. Ligation of the bottom strand requires perfect complementarity at the 3' side 
of the junction (50-fold specificity). 

3. Extension of polymerase off the sequencing primer is most efficient if the 3' 
base is perfectly matched (10 to 100-foid specificity). 



5' 



AC 

CTAAACNNGGC 
GATTTGNNCCG 



5' 
3*- 



FIG. 36 
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Scheme 1 for sequencing Drdl and BgR generated representations 



Individual clone 
from BAG library 

1. Pick individual colony into 

lysis buffer. Panially I 
purify BAG DN A from ^ 
chromosomal DNA. 



2. Cut with restriction 
endonucleases Drdl, Bgl\, 
Msph and Tagl in the 
presence of linkers and T4 
ligase. For Drd\ and Bgll 
sites, add multiple divergent 
linkers with non- 
palindromic overhangs. 



3. PGR amplify to generate 
sufficient DNA template for 
cycle-sequencing. 





4. Aliquot mto multiple wells. 
If needed, perform a 
secondary PGR 
amplification using primers 
which are complementary 
to the particular linker 
sequences. Perform 
individual cycle-sequencing 
reactions. 




:ations. 



FIG. 37 
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Overlapping Drd\ islands in four hypothetical BAG clones: 1 AA overhangs 



la Ic 

i L 



lb 



Ic 



id 



III 



lb 



Ic 



Ic 



IV 





BAC Clone # 


1 = AA 


Concordance 


1 = AA 


Discordance 


1 = AA 




I 


Triplet 
la, lb, Ic 


I&III 


Triplet & Doublet (lb) 


I&U 


la, b, c*ld, e 




U 


Doublet 
Id, le 


II &: III 


Doublet & Doublet (le) 


I&IV 


la, b,c* le 




III 


Doublet 
lb, le 


III & IV 


Doublet & Singlet (le) 








IV 


Singlet 
le 


II&IV 


Doublet & Singlet (le) 






Order of Drdl islands 


in four BAC clones. 








la, Ic 

1 




lb 

1 


le 




Id 



m 



IV 



II 



FIG. 38 
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Overlapping Drdl islands in four hypothetical BAG clones: 2 AC overhangs 



2b 



2c 



2d 



ill 



2e 



2c 



IV 



2c 



2d 



BAC Clone # 


2 = AC 


Concordance 


2 = AC 


Discordance 


2 = AC 


I 


Doublet 
2a, 2b 


i&ni 


No overlap 


I&II 


2a, b 2c, d 


II 


Doublet 
2c, 2d 


ii&ra 


Doublet & 
Doublet (2c) 


I&IV 


2a,b;t2c,d 


lU 


Doublet 
2c, 2e 


III & IV 


Doublet & 
Doublet (2c) 






IV 


Doublet 
2c, 2d 


n&iv 


Doublet & 
Doublet (2c, d) 







Order of Drdl islands in four BAC clones. 

2a, 2b 



2e 



2c 



2d 



I 



lU 



IV 



II 



FIG. 39 
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Overlapping Drdi islands In four hypothetical BAG clones: 3 AG overhangs 



3a 



I 



3b 



3a 



III 



3b 



IV 



Order ofDrdl islands.in four BAC clones. 

3a 

L_ 

I 



3b 



"TT 



IV 



u 



BAC Clone # 3 = AG 


Concordance 3 = AG 


Discordance 3 = AG 


I Singlet 
3a 

n Doublet 
3b, 3c 

III Singlet 

3a 

IV Singlet 

3b 


I & m Singlet & 

Singlet (3a) 

II & III No overlap 

III & IV No overlap 

II & IV Doublet & 
Singlet (3b) 


I&II 3a^3b,c 
I&IV 3a^3b 



3c 



FIG. 40 
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Overlapping Drd\ islands in four hypothetical BAG clones: 4 CA overhangs 



III 



4a 



4a 



IV 



4c 



4c 



4c 



4b 



4b 



BAC Clone # 


4 = CA 


Concordance 


4 = CA 


Discordance 


4 = CA 


I 


Singlet 
4a 


l&III 


Singlet & 
Doublet (4a) 


I&II 


4a ^ 4b, c 


n 


Doublet 
4b, 4c 


11 & HI 


Doublet & 
Doublet (4c) 


I&IV 


4a ^ 4b, c 


m 


Doublet 
4a, 4c 


III & IV 


Doublet & 
Doublet (4c) 






IV 


Doublet 
4b, 4c 


II & IV 


Doublet & 
Doublet (4b, c) 







Order of Drdl islands in four BAC clones. 

4a 



4c 



4b 



I 



III 



IT 



FIG. 41 
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Overlapping Orcfl islands in four hypothetical BAG clones: 5GA overhangs 

5c 5b 5a 



II 



III 

Sd 

IV 



BAG Clone # 


5 = GA 


Concordance 


5 = GA 


Discordance 


5 = GA 


I 


Triplet 
5a, 5b, 5c 


I&III 


No overlap 


I&II 


5a, b, c ^ 5d, e 


II 


Doublet 
5d,5e 


U£clll 


No overlap 


I&IV 


5a, b,C5t5d 


III 


No sequence 


III & IV 


No overlap 






IV 


Singlet 
5d 


Il&IV 


Doublet & 
Singlet (5d) 







Order of Drdl islands in four BAC clones. 

5a, 5b. 5c 



I 

III 

I 

L 



FIG. 42 



5d 



5e 



IV 



II 
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Overlapping Drdi islands in four hypothetical BAG clones: 6 GG overhang 



63 



6b 



III 



6a 



\ 6b 

JL 



6d 



IV 



BAG Clone # 


6 = CO 


Concordance 


6 = GG 


Discordance 


6 = GG 


I 


Ctoublet 
6a, 6b 


I&IIl 


Indeterminant 


I&II 




II 


No sequence 


II & III 


No overlap 


I&IV 


6a, b * 6c 


III 


Multiple! 
(6a, 6b, 6c, 6d) 


III & IV 


Indeterminant 






IV 


Singlet 
6c 


II & IV 


No overlap 







Order of Drd\ islands in four BAG clones. 

6a, 6b? 6a. 6b? 



6c? 



I 



ffl 



IV 



n 



FIG. 43 
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Overlapping Drdi islands in four hypothetical BAG clones 



5c 5b 2b 5a 6a 



I 



2a la \/ Ic 



3a \ 6b lb 4a 



]JL11 



III 



6a 



3a \ 6b lb 

jLl 



4a 



6d 



2c 



IV 



4c 



4c 



6c 2c 



4c 



6c 2c 



5d 4b 5c 



2c Ic 3b 2d I Id 



5d 4b 



le 3b 2d/ 



3c 



BAC Clone # 


1 = AA 


2 = AC 


3 = AG 


4 = CA 


5 = GA 


6 = GG 


I 


Triplet 
la, lb, Ic 


Doublet 
2a, 2b 


Singlet 
3a 


Singlet 
4a 


Triplet 
5a, 5b, 5c 


Doublet 
6a, 6b 


11 


Doublet 
Id, le 


Doublet 
2c, 2d 


Doublet 
3b, 3c 


Doublet 
4b, 4c 


Doublet 
5d,5e 


No sequence 


III 


Doublet 
lb, le 


Doublet 
2c, 2e 


Singlet 
3a 


Doublet 
4a, 4c 


No sequence 


Multiplet 
(6a, 6b, 6c, 6d) 


IV 


Singlet 
le 


Doublet 
2c, 2d 


Singlet 
3b 


Doublet 
4b, 4c 


Singlet 
5d 


Singlet 
6c 




Concordance 


1 =AA 


2 = AC 


3 = AG 


4 = CA 


5 = GA 


6-GG 


i&m 


Triplet & 
Doublet (lb) 


No overlap 


Singlet & 
Singlet (3a) 


Singlet & 
Doublet (4a) 


No overlap 


Indetenninant 


II & III 


Doublet & 
Doublet (le) 


Doublet & 
Doublet (2c) 


No overlap 


Doublet & 
Doublet (4c) 


No overlap 


No overlap 


III & IV 


Doublet & 
Singlet (le) 


Doublet & 
Doublet (2c) 


No overlap 


Doublet & 
Doublet (4c) 


No overlap 


Indeterminant 


II&IV 


Doublet & 
Singlet (le) 


Doublet & 
Doublet (2c, d) 


Doublet & 
Singlet (3b) 


Doublets 
Doublet (4b, c) 


Doublet & 
Singlet (5d) 


No overlap 


Discordance 














l&II 


la, b,c^ld, 
e 


2a, b ^ 2c, d 


3a ^ 3b, c 


4a * 4b, c 


5a, b, c ^ 5d, e 




I&IV 


la,b,c* le 


2a,b^2c,d 


3a^3b 


4a ?i 4b, c 


5a, b, c ;t 5d 


6a, b ?t 6c 



FIG. 44 
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Summary of unique and overlapping Z)rrfl islands in four hypothetical BAC clones: 



Unique I 
Overlap I & III 
Unique II 
Overlap II & HI 
Overlap II & IV 
Unique III 
Overlap I & HI 
Overlap II & UI 
Overlap III & IV 
Unique IV 
Overlap II & IV 
Overlap in & IV 



(la,c), (2a.bX (5a, b, c) 
lb, 3a, 4a 
Id, 5e, 3c 
le 2c 4c 

le! (2c, d), 3b, (4b, c) 5d 
2e 

lb, 3a, 4a 
le, 2c, 4c 
le, 2c. 4c 

(No unambiguous unique site) 
le, (2c, d), 3b, (4b, c) 5d 
1 e, 2c, 4c 



Order of Drdl islands in four BAC clones. 



{ ( 1 a,c), (2a, b), (5a, b, c) } { 1 b, 3a. 4a ) { 2e } {1 e, 2c. 4c } { 2d, 3b, 4b, 5d } {1 d, 3c, 5e } 



L 



m 



IV 



II 



5c 5b 2b 5a 6a 
2a la \ / Ic 3a\ 6b lb 4a 

11 ( l( I I )V 1 1 



III 



6a 

3a\ 6b lb 4a 



3a\ 61 

i 



6d 



.4c 



2c 



le 3b 



4c 



6c 2c 



le 



5d 4b 5e 
Id I Id 



IV 



4c 



6c 2c Ic 3b 2d/ 



5d- 4b 



FIG. 45 
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Drdlj Taql and Mspl sites in overlapping B ACs from 7q31 
Contig 1941 (RG253BI3, RG013N12, and RG300C03) 



Drdl, Mspl Taql 





AG 


AC 


CA 


GA 


AA 


GG 


RG253B13 


546* 


502 


559* 


3.419* 


192* 


864 




353* 


381* 






597* 


4,918 
















RG013N12 


546* 


381* 


559* 


3.419* 


192* 






353* 


1.099 


359 




597* 






1.137t 




.16t 




2,040 














2,328t 


















RG300C03 


1,11377 


212 


16t 




2,328t 








L008 






224 














1,035 
































pBeloBacll 






141 


360 


66 












691 























CT 


GT 


TG 


TC 


TT 


CC 


RG253B13 


1620* 


4497 


754* 


915* 


1794* 


3641 




1631* 


50* 






296* 


.1866 
















RG013N12 


1620* 


50* 


754* 


915* 


1794* 






1631* 


7278 


1908 


811 


296* 






2077t 




183t 




525 














372t 


















RG300C03 


2077t 


282 


183t 




372t 














1227 














1103 
































pBeloBacil 






127 


238 


145 












199 





















RG253B13/RG013N12 = * RG013N12/R RG300C03 = t 



FIG. 46 
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Drdl, Taq\ and Msp\ sites in overlapping BACs from 7q31 

ContigT002144 (RG022J17, RG067E13, RGOl 1J21, RG022C01. and RG043K06) 



Drdl/Mspl/Tao 


I 




AG 


AC 


CA 


GA 


AA 


GG 


RG022J17 


1.215* 


563 




2,977 


933 












77* 


2.608 












142* 


71* 












4,502* 


492* 


















RG067E13 


1,215* 


2,001 1 




77* 


71* 












142* 


492* 












4,502* 




















RGOl 1J21 




2,001 1 




8 


6,019* 


3,66 U 






699 


235 






















RG022C0I 










6,019? 


3,6611 












2,043** 


















RG043K06 






2.127 


510 


2,043** 










39 




5,578 










4 






















pBeloBacll 






141 


360 


66 












691 







RG022J 1 7/ RG067E 13 = * RG067E 1 3/RGO 1 1 J2 1 = t RGO 1 1J2 1 / RG022C0 1 = % 
RG022C01/ RG043K06 = ** 



FIG. 46(cont.) 
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Drdl/MspI/TaqI 





CT 


GT 


TG 


TC 


TT 


CC 


RG022J17 


5335* 




1433 


328 


306 


6* 








6190 


1427* 


2216 












663* 


114* 












2311* 


1470* 


















RG067E13 


5335* 


571t 




1427* 


114* 


6* 










663* 


1470* 












2311* 




















RG011J21 


544$ 


571t 


4716 


4298 




2437* 






2399 


2156 






















RG022C01 


544$ 








549]** 


2437+ 






























RG043K06 






19 


3213 


5491** 










1510 




1981 










2821 






















pBeloBac 1 1 






127 


238 


145 












199 







RG022J17/ RG067E13 = * RGO67E13/RG01 IJ21 = t RGOl 1 J21 / RG022C01 = 1: 
RG022C01/ RG043K06 = ** 



FIG. 46(cont) 
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Drdl^ Taql and Mspl sites in overlapping BACs from 7q31 

Contig T002149 (RG343P13, RG205G13, O68P20, and H-133K23) 



Jrai/Mspi/ 1 ac 


I 




AG 


AC 


CA 


GA 


AA 


GG 


RG343P13 






861 




416 






157* 




4 




426* 














52* 


















RG205G13 


157* 


396t 






426* 














52* 


















O68P20 


825 


396t 


155 


241t 


517 


7491: 








1,178 




119 










285 














2,758 














1,161+ 






















H_133K23 


5984 




l,161t 


241 + 




7491: 




804 








































pBeloBacl 1 






141 


360 


66 












691 







RG343P13/ RG205G13 = * RG205G13/ O68P20 = t O68P20/ H_133K23 = ± 



FIG. 46(cont.) 
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Drdl/Mspl/Taql 





CT 


GT 


TG 


TC 


TT 


CC 


RG343P13 


1348 




4 


246 


144 






58* 








110* 














45* 


















RG205G13 


58* 








110* 














45* 


















O68P20 


1146 




61 


488* 


2438 


1567* 








4573 




394 










1456 














1774 














330t 






















H_133K23 






330$ 


488* 




1567* 










3335 














1181 




















pBeloBacll 






127 


238 


145 












199 







RG343P 1 3/ RG205G 13 = * RG205G 1 3/ O68P20 = t O68P20/ H_ 1 33K23 = * 



FIG. 46(cont) 
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Drdl and Msel sites in overlapping BACs from 7q31 

Contig 1941 (RG253B13. RG013N12, and RG300C03) 

Drd/Msel 





AG 


AC 


CA 


GA 


AA 


GG 


RG253B13 


546* 


203 


294 


36* 


687* 


32 




142* 


47* 








935 
















RG013N12 


546* 


47* 


404 


36* 


687* 






142* 


195 


277t 


103 


325 






39t 








24t 
































RG300C03 


39t 


132 


277t 




24t 








379 






190 














14 
































pBeloBacll 






87 


484 


344 












136 







RG253BI3/RG013N12 = * RG013N12/R RG300C03 = t 



FIG. 47 
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Drdlj Taql and Mspl sites in overlapping BACs from 7q31 

Contig T002144 (RG022J17, RG067E13, RGOl 1J21, RG022C0K and RG043K06) 



Drdl/Msel 





AG 


AC 


CA 


GA 


AA 


GG 


RG022J17 


338* 


109 


134 


38 


19 


55* 










586* 


148 












77* 


273* 












17* 




















RG067E13 


338* 


71t 




586* 


273* 


55* 










77* 














17* 




















RG01IJ21 


92t 


71t 


276 


214 


48* 


42* 






30 


248 




































RG022C01 


92+ 








48* 


42* 






























RG043K06 






550 


59 


80 










77 














32 






















pBeloBacll 






87 


484 


344 












136 







RG022J17/ RG067E13 = * RG067E13/RG01 1J21 = t RGOl 1J21 / RG022C01 = * 
RG022C01/ RG043K06 = ** 



FIG. 47(cont.) 



SUBSTITUTE SHEET (RULE 26) 



wo 00/40755 



71/103 



PCT/USOO/00144 



Drdl/Msel 





CT 


GT 


TG 


TC 


TT 


CC 


RG022J17 


368* 




329 


70 


33 


163* 








186 


84* 


182 












36* 


296* 












57* 


59* 


















RG067E13 


368* 


161t 




84* 


296* 


163* 










36* 


59* 












57* 




















RG011J21 


4U 


161t 


45 


49 


270$ 


101$ 






46 


30 






















RG022C01 


41$ 








270$ 


101$ 












29** 


















RG043K06 






76 


12 


29** 










35 




65 










51 






















pBeloBacl 1 






46 


21 


420 












115 







RG022C01/ RG043K06 = ** 



FIG. 47(cont.) 
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Drdl and Msel sites in overlapping BACs from 7q3L 

Contig T002149 (RG343P13, RG205G13, O68P20, and H-133K23) 



Drdl/Msel 









PA 


r; A 


A A 






lU /o 




jy r 














I 


















286* 


















RG205G13 


1076* 


89t 






648* 














286* 


















O68P20 


59 


89t 


134 


211: 


26 


1681 








62 




63 










22 














206$ 




































H_I33K23 


155 




206t 


21* 




168$ 




36 








































pBeloBacll 






87 


484 


344 












136 







RG343P13/RG205G13 = * RG205GI3/O68P20 = t O68P20/ H_133K23 = i 



FIG. 47 (cont.) 
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41 






/J 








CO* 




oil 




































RG205G13 


53* 


51t 
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RG343P13/RG205G13 = * RG205G13/O68P20 = t O68P20/ H_l 33K23 = J 
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RG253B 1 3, 7q3 1 Met Oncogene 

12 Drdi, 86 Mspl and 62 Taql Sites in 1 7 1 ,905 bp 



Drd I 
Mspi 
Taq I 



1 


25000 50000 


75000 


100000 

f 


125000 


150000 

1 1 




12 ! ; ! 1 1 I 1 II i 1 i 


86 1 


ii: ; ;i !- Hi in H ! 


! lii! I 


i fl ! : 


i M ii 




62 i 


: !i ! : : ! Ii i i: li 


( Mi! 






: : !! 1 1 ii . 



Sequence 




:BAC RG253B13.seq ( 1 


> 171905 ) 


160 Cut Sites 




5385 


Drd 


I 


81884 


Taq I 


99370 


Taq 


I 


6249 


Taq 


I 


84572 


Msp I 


99513 


Msp 


I 


6381 


Msp 


T 


84594 


Msp I 


99628 


Drd 


T 








84831 


Msp I 








21540 


Taq 


I 


85041 


Msp I 


101248 


Taq 


I 


22165 


Msp 


I 


85105 


Msp I 


10144C 


Drd 


T 


25228 


Msp 


I 


85155 


Msp I 








26363 


Msp 


I 


85212 


Msp I 


113048 


Drd 


I 


26871 


Drd 


I 


85523 


Msp I 


113429 


Taq 


I 








85538 


Msp I 


118458 


Taq 


I 


33306 


Drd 


I 


85569 


Msp I 


120734 


Taq 


I 


38218 


Msp 


I 


85625 


Msp I 


122429 


Msp 


I 


39316 


Msp 


I 


85655 


Msp I 








40280 


Msp 


I 


85670 


Msp I 


135890 


Msp 


I 


40344 


Msp 


I 


86173 


Msp I 


136580 


Taq 


I 


40389 


Taq 


I 


88256 


Msp I 


137177 


Drd 


I 








91681 


Drd I 








45534 


Drd 


I 






159685 


Drd 


T 








96506 


Drd I 


160038 


Msp 


I 


70528 


Drd 


T 


97059 


Taq I 


160434 


Msp 










98602 


Msp I 


161212 


Taq 


I 



For AA, AC, AG, CA, GA, and GG overhangs 








Drdl# Location 


Overhang 


Complement 


Nearest 


Nearest 


Fragment 




Mspl 


Taql 


Length 


1. 5,379 


GG* 


CC 


6.381 


6.249 


864 


2. 26,865 


GT 


AC* 


26.363 


21.540 


502 


3. 33,300 


GG* 


CC 


38,218 


40,389 


4,918 


4. 45,528 


AT 


AT 








5. 70,522 


AT 


AT 








6. 91,675 


TC 


GA* 


88.256 


81,884 


3.419 


7. 96,500 


CA* 


TG 


98.602 


97.059 


559 


8. 99,622 


CT 


AG* 


99513 


99,370 


115 


9. 101,434 


TT 


AA* 




101,248 


192 


10. 113,042 


AC* 


GT 


122,429 


113,429 


381 


11. 137,171 


TT 


AA* 


135,890 


136,580 


597 


12. 159,679 


AG* 


CT 


160,038 


161,212 


353 



* To obtain sequence information on AA, AC, AG, CA, GA, or GG overhangs in the sense 
direction, the Drdl island is amplified using a downstream Mspl or Taql site. For such two base 
sequences on the complementary strand, the Drdl island is amplified using an upstream Mspl or 
Taql site. 
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PCR amplification of Drdl representation for sliotgun cloning 
and generating mapped SNPs. 



I . Cut genomic DN As with 
Msp\, Taql and Drd\ in the 
presence of linkers and T4 
ligase. Linker for Drdl site 
is phosphorylated, 
nnethylated ai internal Xmal 
site, and contains a 3' AA 
overhang. Linker for Mjpl 
site is not phosphorylated, 
methylated at internal Xho\ 
site, and contains a bubble. 
Biochemical selection 
assures that most sites 
contain linkers. 



Dnn 
I 

" GACNNNNNNGTC " 
■ CTGNNNNNNCAG- 

1 



Msp\ 
I 

' CCGG 
' GGCC 

1 



■ cccgggaataanngtc" 
gggccctt;^ttnncag - 



CCGT— 3' 

GAGCTC 
m 



2. Inactivate T4 ligase and 
restriction endonucleases at 
95''Cfor5min. PCR 
amplify using unmethylated 
primers. dNTPs, and 
Tag polymerase. 
Conditions are optimized to 
amplify about 35,000 
fragments at high yield 
while minimizing bias in 
the representation. 



5 . 

3'- 



Xmal 

i 

' CCCGGGAATAANNGTC- 
GGGCCCTTATTNNCAG - 

t 



CCGT- 
• GGCA- 



Xho\ 



CTCGAG — 3* 
. GAGCTC — 



3. Cut PCR products with 
Xmal and Xho\, separate 
mixed fragments on an 
agarose gel, select and 
purify 200-1,000 bp 
fragments and clone into 
the corresponding sites of a 
standard vector. Sequence 
inserts to build mapped 
SNP database. 



CCCGGGAATAANNGTC - 
GGGCCCTTATTNNCAG - 



CCGT- 
GGCA- 



CTCGAG ' 
GAGCJC ' 
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PCR amplification of Drdl representation for shotgun cloning 
and generating mapped SNPs. 



I , Cut genomic DN As with 
Mspl TaqX and Drdl in the 
presence of linkers and T4 
ligase. Linker for Drdl site 
is phosphorylated, 
methylated at interna) Xtnal 
site, and contains a 3' A A 
overhang. Linker for Msp\ 
site is phosphorylated, 3' 
blocked, methylated at 
internal Xhol site, and 
contains a bubble. 
Biochemical selection 
assures that most sites 
contain linkers. 



2. Inactivate T4 ligase and 
restriction endonucleases at 
^5°Cfor5min. PCR 
amplify using unmethylated 
primers. dNTPs, and 
Tag polymerase. 
Conditions are optimized to 
amplify about 35,000 
fragments at high yield 
while minimizing bias in 
the representation. 



Drd\ 
4 



• GACNNNNNNGTC- 
■ CTGNNNNNNCAG- 
1 



5 — CCCGGGAATAA 

3' GGGCCCTTA 
m 



Mspl 
i 

•CCGG 
■ GGCC 

1 




5' 
3*' 



. AA 



- cccgggaataanngtc- 
■ gggccctt;^ttnncag - 




Xmal 



Xhol 



5*. 
3'- 



• cccgggaataanngtc- 
gggcccttattnncag - 



• ccgt— ctcgag — 3' 
■ ggca — gagctc — c- 

T 



1 



3, Cut PCR products with 

Xmal and XhoU separate • ^^^^ , » „ » , ™««^ ^^.^.m Zr^^^^ 

. *^ CCCGGGAATAANNGTC ccgt CTCGAG- 

mixed fragments on an ggGCCCTTATTNNCAG GGCA — GAGCTC - 

agarose gel, select and 

purify 200-1,000 bp 

fragments and clone into 

the corresponding sites of a 

standard vector. Sequence 

inserts to build mapped 

SNP database. 
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PGR amplification of Drdl representation for high-throughput SNP detection. 



I. Cui genomic DNA wiih 
Mspl Tagl and Drd\ in ihc 
presence of linkers and T4 
ligasc. Linker for Drdl site 
is phosphorylaicd and 
contains a 3* AA overhang. 
Linker for Mspl site is not 
phosphorytated, and 
contains a bubble. 
Biochemical selection 
assures that most sites 
contain linkers. 



2. Inactivate T4 ligase and 
restriction endonuc leases at 
95**C for 5 min. PGR 
amplify using Drdl primer 
containing a 3' AAC 
overhang, dNTPs, and 
Tag polymerase. 
Conditions are optimized to 
amplify about 9*000 
fragments at high yield - 
while minimizing bias in 
the representation. 



3. Add LDR primers and 
thennostable ligase to 
simultaneously detect SNPs 
at multiple loci. In (A) the 
common LDR primer 
contains zip-code ZK the 
discriminating primers 
contain fluorescent labels 
Fl andF2. after array 
capture, ratio of F1/F2 
determines presence of 
allele or allele imbalance. 
In (B) the common LDR 
primer contains fluorescent 
label F. the discriminating 
primers contain zip-codes 
22 and Z3. after array 
capture, ratio of 
fluorescence ai Z2 and Z3 
dctcraiincs presence of 
allele or allele imbalance. 



5* ■ 



Drdl 
I 



-GACNNNNNNGTC- 
■ CTGNNNNNNCAG- 
t 



• AAC 



• CTAATAANNGTC- 

• GATTATTNNCAG- 



A. 



B. 



Fl- 
F2- 



Z2 , 
23 



'CTAATAACNGTC- 
■ GATTATTGNCAG- 



A 

G or A 



i 



B. 



Msp\ 
4 

■ CCGG 

■ GGCC ' 

t 




Zl 



• CCGT - 
• GGCA . 



r ! 
111 



F F 




Z2 




\ I 

T T 

ffff 

ilii 
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PCR amplification of Drd\ representation for high-throughput SNP detection. 



DrtH 



1 . Cui genomic DNA wiih 
Msph Taq\ and Drdl in the 
presence of linkers and T4 
ligasc. Linker for Drd\ siic 
is phosphorylated and 
contains a 3' AA overiiang. 
Linker for Mspl site is 
phosphorylated. 3' blocked 
and contains a bubble. 
Biochemical selection 
assures that most sites 
contain linkers. 



2, Inactivate T4 ligase and 
restriction endonucleascs at 
95**Cfor5min. PCR 
amplify using Drdl primer 
containing a 3" AAC 
overhang. dNTPs, and 
Tag polymerase. 
Conditions are optimized to 
amplify about 9,000 
fragments at high yield 
while minimizing bias in 
the representation. 



5'- 
3*- 



■ GACNNNNNNGTC- 
' CTGNNNNNNCAG- 
t 



Mspl 

4 

■ CCGG ■ 
GGCC ■ 

T 



— ^ CTAATAA I 
— GATTA V 



• AAC 



5' 
3*' 



•CTAATAANNGTC- 
•GATTATTNNCAG- 



A. 



B. 



Fl- 
F2- 



Z2 
Z3 




Zl 



3. Add LDR primers and 
thermostable ligase to 
simultaneously detect SNPs 
at multiple loci. In (A) the 
common LDR primer 
contains zip<odc ZL the 
discriminating primers 
contain fluorescent labels 
Fl andF2, after array 
capture, ratio of F1/F2 
determines presence of 
allele or allele imbalance. 
In (B) the common LDR 
primer contains fluorescent 
label F. the discriminating 
primers contain zip-codes 
Z2 and Z3. after array 
capture, ratio of 
fluorescence at Z2 and Z3 
determines presence of 
allele or allele imbalance. 



■CTAATAACNGTC- 
■ GATTATTGNCAG- 



G or A 



B. 



• CCGT - 
■ GGCA . 



r [ 



F F 



Zl 





■f I 

T T 

\\\\ 

m 
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DNA Template (fmol) DNA Template (fmol) 
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PCR/LDR with Addressable Array Capture 



A, 



12 13 



61 



200 



400 



600 



800 



Address 




Cys 
Arg 
Ser 



Zip 21 
Zip 15 
Zip 13 



■ T 



Val 
Ala 
Asp 
Wt 



Zip 11 
Zip 5 
Zip 3 
Zip 1 



-T 

-c , 

■A 
■G 



K-ras 12 QGT 



CCA- 



Address 



Wt 
Asp 



Zip 25 
Zip 23 



K-ras 13 GGC 



-QGC- 
•CCG- 
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Mutation: 



Wt(G12) 



Wt G12D 



Wt G12A 



Mutation: 



G12V 



Wt 



G12S 



Wt 



G12R 



mm 



■mm 



mm- 



Mutation: Wt 
G12C 



G13D Wt 



Wt(G13) 
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1 : i 


1 :2 

A 


. A. . 


1 1:3 


A 


1=4 

V 


A. 


1 :6 

i 




1 :1 


2:1 


3:1 

L 




4 : 1 

i 





6:1 

.A. 



Ratio of Normal to LDR Product (fmol) Ratio of LDR Products 



Mutant Template Norma! Mutant Absolute Normalized 

n ' 32:2 STT 062 I : LO 

1:2 • 11.8 41.9 0.28 1:2.2 

1:3 13.7 64.2 0.21 1:3.0 

1:4 12.8 78-4 0.16 1:3.9 

1:6. 6.5 70.2 0.09 1 : 6.7 

1:1 32.2 51.7 0.62 1.0:1 

2:1 41.6 33.1 1.26 2.0:1 

3:1 34.1 18.5 1.84 3.0:1 

4:1 42.7 18.1 2.36 3.8:1 

6:1 64.4 18.4 3.50 5.7:1 
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A. 



PCR/LDR with Addressable Array Capture 



Zl 

Z2 I 



C 



I Fl 



A or G 



LDR reaction: Zip codes on 
discriminating primers. 



Fl Fl 



Fl Fl 



c c c c 



Zl 




F2 



Z2 




Heterozygous: 
C and T alleles. 



F2 



B. 



Fl F2 



3' I 



Fl 
F2 



C 



A or G 

LDR reaction: Zip code 
on common primer. 



C T 




Heterozygous: 
C and T alleles. 
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"VSNUPE with Addressable Array Capture 



3 



ddC 



A or G 




-rr Zip code on upstream primer, 

labeled with F3. Extend with Heterozygous: 
— labeled dideoxynucleotides. c and T alleles. 
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PCR/LDR with Gene Array Capture 



c 

Fl r:-:-H 

F2 



A or G 

LDR reaction: Small 
percentage of common 
primer labeled with F3. 



Fl F2 



C T 



F3- 



Heterozygous: 
C and T alleles. 



Fl C 



F2 E 



C 



Fl F2 



A or G 

LDR reaction: Product is 
complementary to array 




Heterozygous: 
C and T alleles. 
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LDR/PCR with Addressable Array Capture 



Zl c 

1 . LDR reaction: Universal primer i WtftfWtf .^-^ 

and unique Zip codes on 5' side nnnn^m 

of discriminating primers. , 

universal primer on 3' side of 3. . w ^ y ^ y ^V , v y y ^ ^ ^ 

common primer. ^ 

^ A or G 



I 

T 



2. PGR reaction: Universal primers 5' 
amplify multiplex LDR products 
simultaneously. One primer Is 



5' i Bftftftttftt. 11 'mm 3' 
fluorescently labeled. 5< 



3. Capture: Fluorescently labeled 
products are captured on 
addressable array at unique 
zipcode sequences. 



Fl Fl 

I I 



c c 



Zl 




Fl Fl 

I I 



fl' f f 



Z2 




Heterozygous: 
C and T alleles. 
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LDR/PCR with Addressable Array Capture 



. LDR reaction: Universal primer 
and unique Zip codes on 5' side 
of discriminating primers, 
universal primer on 3' side of 
common primer. 



Zl 



Z2 

. A or G 



. PGR reaction: Universal primers 
amplify multiplex LDR products 
simultaneously. One primer is 
fluorescently labeled, while the 
other contains a 5' phosphate. 
After PGR ampliftcation. the 
phosphorylated strand is 
digested with lambda 
exonudease leaving 
fluorescently labeled single- 
stranded DNA. 



5' pl=l 



I 3' 
Fl 5' 



3. Gapture: Fluorescently labeled 
products are captured on 
addressable array at unique 
zipcode sequences. 



Fl Fl 



Fl 



Fl 



Zl 




I I 



Z2 




Heterozygous: 
C and T alleles. 
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PCR/LDR with Addressable Array Capture 



Z3 Fl 



Fl Fl 



Fl Fl 



C C C C 



T T T T 



A or G 

LDR reaction; Zip codes on 
discriminating primers, small 
percentage labeled with F2. 




Heterozygous: 
C and T alleles. 



B. 



Fl C 



c 

33 



Z3 



F2 C 



3'1 \ \ S S \ \ \ S S \ VsT) 5' 
A or G 

LDR reaction: Zip code on 

common primer, small 
percentage labeled with F3. 



Fl F2 



C T 




Heterozygous: 
C and T alleles. 



FIG. 62 



SUBSTITUTE SHEET (RULE 26) 



wo 00/40755 



91/103 



PCTAJSOO/00144 



PCR/LDR with Addressable Array Capture: Detection of gene 
amplification using zip codes on the discriminating primers. 




Tumor sample with 50% stromal contamination: 
A. Tumor gene alleles: Ratio of C to T alleles = 10 / 4 = 2.5 

Fl Fl Fl Fl 




B. Control gene alleles: Ratio of G to A alleles =: 4 / 4 = 1.0 

Fl Fl 

I 1 

gggg aAAA 




Normal sample with allele balance: 

C. Tumor gene alleles: Ratio of C to T alleles = 4 / 4 = 1 .0 

D. Control gene alleles: Ratio of G to A alleles = 4 / 4 = 1.0 

Ratio of Tumor to control allele / normal to control allele: 
C:GTumor/ C : G Normal = 10 74/4/4 = 2.5 
TrATumor/ T : A Normal = 4/4/4/4 = 1.0 

FIG. 63 



SUBSTITUTE SHEET (RULE 26) 



wo 00/40755 



92/103 



PCT/USOO/00144 



PCR/LDR with Addressable Array Capture: Detection of gene 
amplification using zip codes on the common primers. 




<Jimj"i <jmjii (li 

Tumor sample with 50% stromal contamination: 
A. Tumor gene alleles: Ratio of C to T alleles = 1 0 / 4 = 2.5 

Fl Fl Fl F2 



II 



C T 



C 

Fl 



F2 i-.-.P 



Z3 



3'kVVV V\M 



A or G 



B. Control gene alleles: Ratio of G to A alleles = 4 / 4 = 1.0 

Fl F2 



Fl 



F2 



aC 



Z4 



T or C 




Normal sample with allele balance: 

C. Tumor gene alleles: Ratio of C to T alleles = 4 / 4 = 1.0 

D. Control gene alleles: Ratio of G to A alleles = 4 / 4 = 1.0 



Ratio of Tumor to control allele / normal to control allele: 
CiGTumor/ C : G Normal = 10/4/4/4 = 2.5 
TrATumor/ T : A Normal = 4 /4/4/4 = 1.0 
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PCR/LDR with Addressable Array Capture: Detection of loss of 
heterozygosity using zip codes on the discriminating primers. 

Tumor sample with 40% stromal contamination: 
A. Tumor gene alleles: Ratio of C to T alleles = 5/2 = 2.5 

Fl Fl Fl 



C C C c 



T T T T 



Zl I 



J-'-'-^--t Fl 



5' 



A or G 




B. Control gene alleles: Ratio of G to A alleles = 5 / 5 = 1.0 

Fl Fl Fl Fl 



G G G G 



A A A A 



Z3 I 



C33Z3 Fl B 



Z4E 



3'K WW 



33313 5' 



T or C 




(Fii) (V^ (Til) 

Normal sample with allele balance: 

C. Tumor gene alleles: Ratio of C to T alleles = 5 / 5 = 1 .0 

D. Control gene alleles: Ratio of G to A alleles = 5 / 5 = 1.0 



Ratio of Tumor to control allele / normal to control allele: 
C:GTumor/ C : G Normal = 5/5/5/5 = 1.0 
T : A Tumor / T : A Normal = 2 / 5 / 5 / 5 = 0.4 



FIG. 65 



SUBSTITUTE SHEET (RULE 26) 



wo 00/40755 



94/103 



PCTAJSOO/00144 



PCR/LDR with Addressable Array Capture: Detection of loss of 
heterozygosity using zip codes on the common primers. 





Tumor sample with 40% stromal contamination: 
A. Tumor gene alleles: Ratio of C to T alleles = 5 / 2 = 2.5 

Fl Fl F2 



vi3 



C T 



Fl C 



C 



F2 

A or G 



B. Control gene alleles: Ratio of G to A alleles = 5 / 5 =: 1.0 

"f1 F2 F2 Fl 



Fl i'-'-U 
F2 

T or C 




(jMi) (Jji) (n^ (Ihj) 

Normal sample with allele balance: 

C. Tumor gene alleles: Ratio of C to T alleles = 5 / 5 = 1.0 

D. Control gene alleles: Ratio of G to A alleles = 5 / 5 = 1.0 



Ratio of Tumor to control allele / normal to control allele: 
C:GTumor/ C : G Normal = 5/5/5/5 = 1.0 
T:ATumor/ T : A Norifi^iiie25£5 / 5 / 5 = 0.4 
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Detection of gene amplification in tumor samples which contain stromal contamination using 
zip codes on the discriminating primers. 

Tumor samples contains 10.000 tumor gene C alleles, and 4,000 tumor eene T alleles. 

Fl forCallele (60% of 10.000)(45% capture at Zl)/F2 (10% of 100,000)(45% capture at Zl) 
Fl for C allele (= 2 JOO) / F2 (= 4,500) 
Fl for C allele /F2 = 0.60 

Fl for T allele (40% of 4,000)(30% capture at Z2) / F2 (10% of 1 00,000)(30% capture at Z2) 
Fl for T allele (= 480) / F2 ( = 3,000) 
Fl for T allele /F2 = 0.1 6 

Normal samples contains 4.000 tumor gene C alleles, and 4.000 tumor gene T alleles. 

Fl for C allele (60% of 4,000)(35% capture at ZI) / F2 (10% of I00,000)(35% capture at Zl) 
Fl for C allele (= 840) / F2 (= 3,500) 
Fl for C allele /F2 = 0.24 

Fl for T allele (40% of 4,000)(50% capture at Z2) / F2 ( 1 0% of 1 00,000)(50% capture at Z2) 
Fl for T allele (= 800) / F2 ( = 5,000) 
Fl forTalle]e/F2 = 0.16 

Tumor sample contains 4,000 control gene G alleles, and 4,000 control gene A alleles. 

Fl for G allele (45% of 4.000)(40% capture at Z3) / F2 (10% of 100,000)(40% capture at Z3) 
Fl for G allele (= 720) / F2 (= 4,000) 
Fl for G allele / F2 = 0.1 8 

Fl for A allele (55% of 4,000)(60% capture at Z4) / F2 ( 1 0% of 1 00,000)(60% capture at Z4) 
Fl for A allele (= 1320) / F2 (= 6,000) 
Fl for A allele /F2 = 0.22 

Normal sample contains 4.000 control gene G alleles, arid 4,000 control gene A alleles. 

Fl for G allele (45% of 4,000)(55% capture at Z3) / F2 (10% of IOO,000)(55%) capture at Z3) 
Fl for G allele (= 990) / F2 (= 5,500) 
Fl for G allele /F2 = 0.18 

Fl for A allele (55% of 4,000)(45% capture at Z4) / F2 (10% of 1 00,000)(45% capture at Z4) 
Fl for A allele (= 990) / F2 (=4,500) 
Fl for A allele /F2 = 0.22 

C : G Tumor / C : G Normal = ( 0.60 / 0. 1 8 ) / (0.24 / 0. 1 8 ) = 2.5 
T : A Tumor / T : A Normal = (0. 1 6 / 0.22 ) / (0. 1 6 / 0.22 ) = 1 
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Detection of gene amplification in tumor samples which contain stromal contamination using zip codes on 
the common primers. 

Tumor samples contains 10.000 tumor gene C alleles, and 4.000 tumor eene T alleles. 

Fl for C allele (60% of 10,000)(55% capture at Z3) / F2 (10% of 100.000)(55% capture at Z3) 
Fl for C allele (= 3,300) / F2 (= 5.500) 
Fl for C allele / F2 = 0.60 

Fl for T allele (40% of 4,000)(55% capture at Z3) / F2 (10% of 100,000)(55% capture at Z3) 
Fl for T allele (= 880) / F2 ( = 5.500) 
Fl for T allele /F2 = 0.16 

Normal samples contains 4,000 tumor gene C alleles, and 4,000 tumor gene T alleles. 

Fl for C allele (60% of 4.000)(60% capture at Z3) / F2 (10% of 100,000)(60% capture at Z3) 
Fl for C allele (= 1 ,440) / F2 (= 6,000) 
Fl for C allele /F2 = 0.24 

Fl for T allele (40% of 4,000)(60% capture at Z3) / F2 (10% of 100,000)(60% capture at Z3) 
Fl for T allele (= 960) / F2 ( = 6,000) 
Fl for T allele /F2 = 0.16 

Tumor sample contains 4,000 control gene G alleles, and 4.000 control gene A alleles. 

Fl for G allele (45% of 4,000)(35% capture at Z4) / F2 (10% of IOO,000)(35% capture at Z4) 
Fl for G allele (= 630) / F2 (= 3,500) 
Fl for G allele /F2 = 0.1 8 

Fl for A allele (55% of 4,000)(35% capture at Z4) / F2 (10% of 100,000)(35% capture at Z4) 
Fl for A allele (= 770) / Fl (= 3,500) 
Fl for A allele /F2 = 0.22 

Normal sample contains 4.000 control gene G alleles, and 4.000 control gene A alleles. 

Fl for G allele (45% of 4,000)(30% capture at Z4) / F2 (10% of 100,000)(30% capture at Z4) 
Fl for G allele (= 540) / F2 (= 3,000) 
Fl forGallele/F2 = 0.18 

Fl for A allele (55% of 4,000)(30% capture at Z4) / F2 (10% of 100,000)(30% capture at Z4) 
Fl for A allele (= 660) / F2 (=3,000) 
Fl for A allele /F2 = 0.22 

C : G Tumor / C : G Normal = ( 0.60 / 0. 1 8 ) / (0.24 / 0. 1 8 ) = 2.5 
T : A Tumor / T : A Normal = (0. 1 6 / 0,22 ) / (0. 1 6 / 0.22 ) = I 
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Detection of loss of heterozygosity (LOH) in tumor samples which contain stromal contamination using 
zip codes on the discriminating primers. 

Tumor samples contains 5.000 tumor gene C alleles, and 2.000 tumor gene T alleles. 

Fl for C allele (60% of 5,000)(35% capture at Zl) / F2 (10% of 100.000)(35% capture at Zl) 
Fl for C allele (= 1 ,050) / F2 (= 3,500) 
Fl for C allele /F2 = 0.30 

Fl for T allele (40% of 2.000)(55% capture at Z2) / F2 (10% of IOO.OOO)(55% capture at Z2) 
Fl for T allele (= 440) / F2 ( = 5,500) 
Fl fori allele /F2 = 0.08 

Normal samples contains 5.000 tumor gene C alleles, and 5.000 tumor gene T alleles. 

Fl for C allele (60% of 5,000)(30% capture at Zl) / F2 (10% of 100.000)(30% capture at Zl) 
Fl for C allele (= 900) / F2 (= 3.000) 
Fl for C allele /F2= 0.30 

Fl for T allele (40% of 5.000)(40% capture at Z2) / F2 (10% of 100,000)(40% capture at Z2) 
Fl for T allele (= 800) / F2 ( = 4,000) 
Fl for T allele /F2= 0.20 

Tumor sample contains 5.000 control gene G alleles, and 5.000 control gene A alleles. 

Fl for G allele (45% of 5,000)(45% capture at Z3) / F2 (10% of 100,000)(45% capture at Z3) 
Fl for G allele (= 1 .01 2) / F2 (= 4,500) 
Fl for G allele /F2 = 0.22 

Fl for A allele (55% of 5,0OO)(50% capture at Z4) / F2 (10% of 100,000)(50% capture at Z4) 
Fl for A allele (= 1 375) / F2 (= 5,000) 
Fl for A allele /F2= 0.27 

Normal sample contains 5.000 control gene G alleles, and 5.000 control gene A alleles. 

Fl for G allele (45% of 5,000)(30% capture at Z3) / F2 (10% of I00,000)(30% capture at Z3) 
Fl for G allele (= 675) / F2 (= 3,000) 
Fl for G allele /F2= 0.22 

Fl for A allele (55% of 5,000)(60% capture at Z4) / F2 (10% of 1 00,00d)(60% capture at Z4) 
Fl for A allele {= 1 ,650) / F2 (=6,000) 
Fl for A allele /F2 = 0.27 

C : G Tumor / C : G Normal = (0.30 / 0.22 ) / (0.30 / 0.22 ) = 1 
T : A Tumor / T : A Normal = (0.08 / 0.27 ) / (0.20 / 0.27 ) = 0.4 
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Detection of loss of heterozygosity (LOH) in tumor samples which contain stromal contamination using 
zip codes on the common primers. 

Tumor samples contains 5.000 tumor gene C alleles, and 2,000 tumor gene T alleles. 

Fl for C allele (60% of 5,000)(60% capture at Z3) / F2 (10% of I00,000)(60% capture at Z3) 
Fl for C allele (= 1 ,800) / F2 (= 6,000) 
Fl for C allele /F2 = 0.30 

Fl for T allele (40% of 2,000)(60% capture at Z3) / F2 (10% of 100,000)(60% capture at Z3) 
Fl for T allele (= 480) / F2 ( = 6,000) 
Fl for T allele /F2 = 0.08 

Normal samples contains 5.000 tumor gene C alleles, and 5,000 tumor gene T alleles. 

. Fl for C allele (60% of 5,000)(55% capture at Z3) / F2 (10% of 100,000)(55% capture at Z3) 
Fl for C allele (= i ,650) / F2 (= 5,500) 
Fl for C allele /F2 = 0.30 

Fl for T allele (40% of 5,000)(55% capture at Z3) / F2 (10% of 100,000)(55% capture at Z3) 
Fl for T allele (= 1.100) /F2 ( = 5,500) 
Fl for T allele /F2 = 0.20 

Tumor sample contains 5,000 control gene G alleles, and 5,000 control gene A alleles. 

Fl for G allele (45% of 5,000)(40% capture at Z4) / F2 (10% of 100,000)(40% capture at Z4) 
Fl for G allele (= 900) / F2 (= 4,000) 
Fl for G allele /F2 = 0.22 

Fl for A allele (55% of 5,000)(40% capture at Z4) / F2 (10% of 100,000)(40% capture at Z4) 
Fl for A allele (= 1,100) /F2 (=4,000) 
Fl for A allele /F2 = 0.27 

Normal sample contains 5.000 control gene G alleles, and 5,000 control gene A alleles. 

Fl for G allele (45% of 5,000)(45% capture at Z4) / F2 (10% of 100,000)(45% capture at Z4) 
Fl for G allele (= 1,012) /F2 (=4,500) 
Fl for G allele / F2 = 0.22 

Fl for A allele (55% of 5,000)(45% capture at Z4) / F2 (10% of 100,000)(45% capture at.Z4) 
Fl for A allele (= 1,237) /F2 (=4,500) 
Fl for A allele /F2 = 0.27 

C : G Tumor / C : G Normal = (0.30 / 0.22 ) / (0.30 / 0,22 ) = I 
T : A Tumor / T : A Normal = (0.08 / 0.27 ) / (0.20 / 0.27 ) = 0.4 
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Defects in the images include but are not limited to the items checked: 

□ BLACK BORDERS 

□ IMAGE CUT OFF AT TOP, BOTTOM OR SffiES 
E^F^DED TEXT OR DRAWING 

□ blurred OR ILLEGIBLE TEXT OR DRAWING 

□ SKEWED/SLANTED IMAGES 

□ COLOR OR BLACK AND WHITE PHOTOGRAPHS 

□ GRAY SCALE DOCUMENTS 

□ LINES OR MARKS ON ORIGINAL DOCUMENT 

□ REFERENCE(S) OR EXHIBIT(S) SUBMITTED ARE POOR QUALITY 

□ OTHER: 

IMAGES ARE BEST AVAILABLE COPY. 
As rescanning these documents will not correct the image 
problems checked, please do not report these problems to 
the IFW Image Problem Mailbox. 



